Re: [zfs-discuss] partioned cache devices

2013-03-16 Thread Richard Elling

On Mar 16, 2013, at 7:01 PM, Andrew Werchowiecki 
andrew.werchowie...@xpanse.com.au wrote:

 It's a home set up, the performance penalty from splitting the cache devices 
 is non-existant, and that work around sounds like some pretty crazy amount of 
 overhead where I could instead just have a mirrored slog.
 
 I'm less concerned about wasted space, more concerned about amount of SAS 
 ports I have available.
 
 I understand that p0 refers to the whole disk... in the logs I pasted in I'm 
 not attempting to mount p0. I'm trying to work out why I'm getting an error 
 attempting to mount p2, after p1 has successfully mounted. Further, this has 
 been done before on other systems in the same hardware configuration in the 
 exact same fashion, and I've gone over the steps trying to make sure I 
 haven't missed something but can't see a fault. 

You can have only one Solaris partition at a time. Ian already shared the 
answer, Create one 100% 
Solaris partition and then use format to create two slices.
 -- richard

 
 I'm not keen on using Solaris slices because I don't have an understanding of 
 what that does to the pool's OS interoperability. 
 
 From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
 [opensolarisisdeadlongliveopensola...@nedharvey.com]
 Sent: Friday, 15 March 2013 8:44 PM
 To: Andrew Werchowiecki; zfs-discuss@opensolaris.org
 Subject: RE: partioned cache devices
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andrew Werchowiecki
 
 muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
 Password:
 cannot open '/dev/dsk/c25t10d1p2': I/O error
 muslimwookie@Pyzee:~$
 
 I have two SSDs in the system, I've created an 8gb partition on each drive 
 for
 use as a mirrored write cache. I also have the remainder of the drive
 partitioned for use as the read only cache. However, when attempting to add
 it I get the error above.
 
 Sounds like you're probably running into confusion about how to partition the 
 drive.  If you create fdisk partitions, they will be accessible as p0, p1, 
 p2, but I think p0 unconditionally refers to the whole drive, so the first 
 partition is p1, and the second is p2.
 
 If you create one big solaris fdisk parititon and then slice it via 
 partition where s2 is typically the encompassing slice, and people usually 
 use s1 and s2 and s6 for actual slices, then they will be accessible via s1, 
 s2, s6
 
 Generally speaking, it's unadvisable to split the slog/cache devices anyway.  
 Because:
 
 If you're splitting it, evidently you're focusing on the wasted space.  
 Buying an expensive 128G device where you couldn't possibly ever use more 
 than 4G or 8G in the slog.  But that's not what you should be focusing on.  
 You should be focusing on the speed (that's why you bought it in the first 
 place.)  The slog is write-only, and the cache is a mixture of read/write, 
 where it should be hopefully doing more reads than writes.  But regardless of 
 your actual success with the cache device, your cache device will be busy 
 most of the time, and competing against the slog.
 
 You have a mirror, you say.  You should probably drop both the cache  log.  
 Use one whole device for the cache, use one whole device for the log.  The 
 only risk you'll run is:
 
 Since a slog is write-only (except during mount, typically at boot) it's 
 possible to have a failure mode where you think you're writing to the log, 
 but the first time you go back and read, you discover an error, and discover 
 the device has gone bad.  In other words, without ever doing any reads, you 
 might not notice when/if the device goes bad.  Fortunately, there's an easy 
 workaround.  You could periodically (say, once a month) script the removal of 
 your log device, create a junk pool, write a bunch of data to it, scrub it 
 (thus verifying it was written correctly) and in the absence of any scrub 
 errors, destroy the junk pool and re-add the device as a slog to the main 
 pool.
 
 I've never heard of anyone actually being that paranoid, and I've never heard 
 of anyone actually experiencing the aforementioned possible undetected device 
 failure mode.  So this is all mostly theoretical.
 
 Mirroring the slog device really isn't necessary in the modern age.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabyte pool?

2013-03-15 Thread Richard Elling
On Mar 15, 2013, at 6:09 PM, Marion Hakanson hakan...@ohsu.edu wrote:

 Greetings,
 
 Has anyone out there built a 1-petabyte pool?

Yes, I've done quite a few.

  I've been asked to look
 into this, and was told low performance is fine, workload is likely
 to be write-once, read-occasionally, archive storage of gene sequencing
 data.  Probably a single 10Gbit NIC for connectivity is sufficient.
 
 We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
 using 4TB nearline SAS drives, giving over 100TB usable space (raidz3).
 Back-of-the-envelope might suggest stacking up eight to ten of those,
 depending if you want a raw marketing petabyte, or a proper power-of-two
 usable petabyte.

Yes. NB, for the PHB, using N^2 is found 2B less effective than N^10.

 I get a little nervous at the thought of hooking all that up to a single
 server, and am a little vague on how much RAM would be advisable, other
 than as much as will fit (:-).  Then again, I've been waiting for
 something like pNFS/NFSv4.1 to be usable for gluing together multiple
 NFS servers into a single global namespace, without any sign of that
 happening anytime soon.

NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
without needing the complexity of NFSv4.1, lustre, glusterfs, etc.

 
 So, has anyone done this?  Or come close to it?  Thoughts, even if you
 haven't done it yourself?

Don't forget about backups :-)
 -- richard


--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Richard Elling
On Feb 26, 2013, at 12:33 AM, Tiernan OToole lsmart...@gmail.com wrote:

 Thanks all! I will check out FreeNAS and see what it can do... I will also 
 check my RAID Card and see if it can work with JBOD... fingers crossed... The 
 machine has a couple internal SATA ports (think there are 2, could be 4) so i 
 was thinking of using those for boot disks and SSDs later... 
 
 As a follow up question: Data Deduplication: The machine, to start, will have 
 about 5Gb  RAM. I read somewhere that 20TB storage would require about 8GB 
 RAM, depending on block size... Since i dont know block sizes, yet (i store a 
 mix of VMs, TV Shows, Movies and backups on the NAS)

Consider using different policies for different data. For traditional file 
systems, you
had relatively few policy options: readonly, nosuid, quota, etc. With ZFS, 
dedup and
compression are also policy options. In your case, dedup for your media is not 
likely
to be a good policy, but dedup for your backups could be a win (unless you're 
using
something that already doesn't backup duplicate data -- eg most backup 
utilities).
A way to approach this is to think of your directory structure and create file 
systems
to match the policies. For example:
/home/richard = compressed (default top-level, since properties are 
inherited)
/home/richard/media = compressed
/home/richard/backup = compressed + dedup

 -- richard

 I am not sure how much memory i will need (my estimate is 10TB RAW (8TB 
 usable?) in a ZRAID1 pool, and then 3TB RAW in a striped pool). If i dont 
 have enough memory now, can i enable DeDupe at a later stage when i add 
 memory? Also, if i pick FreeBSD now, and want to move to, say, Nexenta, is 
 that possible? Assuming the drives are just JBOD drives (to be confirmed) 
 could they just get imported?
 
 Thanks.
 
 
 On Mon, Feb 25, 2013 at 6:11 PM, Tim Cook t...@cook.ms wrote:
 
 
 
 On Mon, Feb 25, 2013 at 7:57 AM, Volker A. Brandt v...@bb-c.de wrote:
 Tim Cook writes:
   I need something that will allow me to share files over SMB (3 if
   possible), NFS, AFP (for Time Machine) and iSCSI. Ideally, i would
   like something i can manage easily and something that works with
   the Dell...
 
  All of them should provide the basic functionality you're looking
  for.
   None of them will provide SMB3 (at all) or AFP (without a third
  party package).
 
 FreeNAS has AFP built-in, including a Time Machine discovery method.
 
 The latest FreeNAS is still based on Samba 3.x, but they are aware
 of 4.x and will probably integrate it at some point in the future.
 Then you should have SMB3.  I don't know how far along they are...
 
 
 Best regards -- Volker
 
 
 
 FreeNAS comes with a package pre-installed to add AFP support.  There is no 
 native AFP support in FreeBSD and by association FreeNAS.  
 
 --Tim
  
 
 
 
 -- 
 Tiernan O'Toole
 blog.lotas-smartman.net
 www.geekphotographer.com
 www.tiernanotoole.ie
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot destroy, volume is busy

2013-02-21 Thread Richard Elling

On Feb 21, 2013, at 8:02 AM, John D Groenveld jdg...@elvis.arl.psu.edu wrote:

 # zfs list -t vol
 NAME   USED  AVAIL  REFER  MOUNTPOINT
 rpool/dump4.00G  99.9G  4.00G  -
 rpool/foo128  66.2M   100G16K  -
 rpool/swap4.00G  99.9G  4.00G  -
 
 # zfs destroy rpool/foo128
 cannot destroy 'rpool/foo128': volume is busy
 
 I checked that the volume is not a dump or swap device
 and that iSCSI is disabled.

The iSCSI service is not STMF. STMF will need to be disabled, or the volume no 
longer
used by STMF.

iSCSI service is svc:/network/iscsi/target:default
STMF service is svc:/system/stmf:default


 
 On Solaris 11.1, how would I determine what's busying it?

One would think that fuser would work, but in my experience, fuser rarely does
what I expect.

If you suspect STMF, then try
stmfadm list-lu -v

 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-20 Thread Richard Elling
On Feb 20, 2013, at 2:49 PM, Markus Grundmann mar...@freebsduser.eu wrote:

 Hi!
 
 My name is Markus and I living in germany. I'm new to this list and I have a 
 simple question
 related to zfs. My favorite operating system is FreeBSD and I'm very happy to 
 use zfs on them.
 
 It's possible to enhance the properties in the current source tree with an 
 entry like protected?
 I find it seems not to be difficult but I'm not an professional C programmer. 
 For more information
 please take a little bit of time and read my short post at
 
 http://forums.freebsd.org/showthread.php?t=37895
 
 I have reviewed some pieces of the source code in FreeBSD 9.1 to find out how 
 difficult it was to
 add an pool / filesystem property as an additional security layer for 
 administrators.
 
 Whenever I modify zfs pools or filesystems it's possible to destroy [on a bad 
 day :-)] my data. A new
 property protected=on|off in the pool and/or filesystem can help the 
 administrator for datalost
 (e.g. zpool destroy tank or zfs destroy tank/filesystem command will be 
 rejected
 when protected=on property is set).

Look at the delegable properties (zfs allow). For example, you can delegate a 
user to have
specific privileges and then not allow them to destroy. 

Note: I'm only 99% sure this is implemented in FreeBSD, hopefully someone can 
verify.
 -- richard

 
 It's anywhere here on this list their can discuss/forward this feature 
 request? I hope you have
 understand my post ;-)
 
 Thanks and best regards,
 Markus
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-20 Thread Richard Elling
On Feb 20, 2013, at 3:27 PM, Tim Cook t...@cook.ms wrote:
 On Wed, Feb 20, 2013 at 5:09 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Feb 20, 2013, at 2:49 PM, Markus Grundmann mar...@freebsduser.eu wrote:
 
 Hi!
 
 My name is Markus and I living in germany. I'm new to this list and I have a 
 simple question
 related to zfs. My favorite operating system is FreeBSD and I'm very happy 
 to use zfs on them.
 
 It's possible to enhance the properties in the current source tree with an 
 entry like protected?
 I find it seems not to be difficult but I'm not an professional C 
 programmer. For more information
 please take a little bit of time and read my short post at
 
 http://forums.freebsd.org/showthread.php?t=37895
 
 I have reviewed some pieces of the source code in FreeBSD 9.1 to find out 
 how difficult it was to
 add an pool / filesystem property as an additional security layer for 
 administrators.
 
 
 Whenever I modify zfs pools or filesystems it's possible to destroy [on a 
 bad day :-)] my data. A new
 property protected=on|off in the pool and/or filesystem can help the 
 administrator for datalost
 (e.g. zpool destroy tank or zfs destroy tank/filesystem command will 
 be rejected
 when protected=on property is set).
 
 Look at the delegable properties (zfs allow). For example, you can delegate a 
 user to have
 specific privileges and then not allow them to destroy. 
 
 Note: I'm only 99% sure this is implemented in FreeBSD, hopefully someone can 
 verify.
  -- richard
 
 
 
 With the version of allow I'm looking at, unless I'm missing a setting, it 
 looks like it'd be a complete nightmare.  I see no concept of deny, so that 
 means you either have to give *everyone* all permissions besides delete, or 
 you have to go through every user/group on the box and give specific 
 permissions and on top of not allowing destroy.  And then if you change your 
 mind later you have to go back through and give everyone you want to have 
 that feature access to it.  That seems like a complete PITA to me.  

:-) they don't call it idiot-proofing for nothing! :-)

But seriously, one of the first great zfs-discuss wars was over the request for 
a
-f flag for destroy. The result of the research showed that if you typed 
destroy
then you meant it, and adding a -f flag just teaches you to type destroy -f 
instead.
See also kill -9
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL

2013-02-16 Thread Richard Elling
On Feb 16, 2013, at 10:16 PM, Bryan Horstmann-Allen b...@mirrorshades.net 
wrote:

 +--
 | On 2013-02-17 18:40:47, Ian Collins wrote:
 | 
 One of its main advantages is it has been platform agnostic.  We see 
 Solaris, Illumos, BSD and more recently ZFS on Linux questions all give the 
 same respect.
 
 I do hope we can get another, platform agnostic, home for this list.
 
 As the guy who provides the illumos mailing list services, and as someone who
 has deeply vested interests in seeing ZFS thrive on all platforms, I'm happy 
 to
 suggest that we'd welcome all comers on z...@lists.illumos.org.

+1
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to know available disk space

2013-02-07 Thread Richard Elling
On Feb 6, 2013, at 5:17 PM, Gregg Wonderly gregg...@gmail.com wrote:

 This is one of the greatest annoyances of ZFS.  I don't really understand 
 how, a zvol's space can not be accurately enumerated from top to bottom of 
 the tree in 'df' output etc.  Why does a zvol divorce the space used from 
 the root of the volume?

Thick (with reservation) or thin provisioning behave differently. Also, 
depending on
how you created the reservation, it might or might not account for the metadata 
overhead needed. By default, space for metadata is reserved, but if you use the
-s (sparse aka thin provisioning) option, then later reservation changes are 
set as
absolute.

Also, metadata space, compression, copies, and deduplication must be accounted 
for. 
The old notions of free/available don't match very well with these modern 
features.

 
 Gregg Wonderly
 
 On Feb 6, 2013, at 5:26 PM, Edward Ned Harvey 
 (opensolarisisdeadlongliveopensolaris) 
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 
 I have a bunch of VM's, and some samba shares, etc, on a pool.  I created 
 the VM's using zvol's, specifically so they would have an 
 appropriaterefreservation and never run out of disk space, even with 
 snapshots.  Today, I ran out of disk space, and all the VM's died.  So 
 obviously it didn't work.
  
 When I used zpool status after the system crashed, I saw this:
 NAME  SIZE  ALLOC   FREE  EXPANDSZCAP  DEDUP  HEALTH  ALTROOT
 storage   928G   568G   360G -61%  1.00x  ONLINE  -
  
 I did some cleanup, so I could turn things back on ... Freed up about 4G.
  
 Now, when I use zpool status I see this:
 NAME  SIZE  ALLOC   FREE  EXPANDSZCAP  DEDUP  HEALTH  ALTROOT
 storage   928G   564G   364G -60%  1.00x  ONLINE  -
  
 When I use zfs list storage I see this:
 NAME  USED  AVAIL  REFER  MOUNTPOINT
 storage   909G  4.01G  32.5K  /storage
  
 So I guess the lesson is (a) refreservation and zvol alone aren't enough to 
 ensure your VM's will stay up.  and (b) if you want to know how much room is 
 *actually* available, as in usable, as in, how much can I write before I 
 run out of space, you should use zfs list and not zpoolstatus

Correct. zpool status does not show dataset space available.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Richard Elling
On Jan 29, 2013, at 6:08 AM, Robert Milkowski rmilkow...@task.gda.pl wrote:

 From: Richard Elling
 Sent: 21 January 2013 03:51
 
 VAAI has 4 features, 3 of which have been in illumos for a long time. The
 remaining
 feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
 product, 
 but the CEO made a conscious (and unpopular) decision to keep that code
 from the 
 community. Over the summer, another developer picked up the work in the
 community, 
 but I've lost track of the progress and haven't seen an RTI yet.
 
 That is one thing that always bothered me... so it is ok for others, like
 Nexenta, to keep stuff closed and not in open, while if Oracle does it they
 are bad?

Nexenta is just as bad. For the record, the illumos-community folks who worked 
at
Nexenta at the time were overruled by executive management. Some of those folks
are now executive management elsewhere :-)

 
 Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
 sort of the same)

No, not at all.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Richard Elling
On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote:
 But, by talking about it, we're just smoking pipe dreams.  Cuz we all know 
 zfs is developmentally challenged now.  But one can dream...

I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than 
it seems in early life?

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Richard Elling
On Jan 20, 2013, at 4:51 PM, Tim Cook t...@cook.ms wrote:

 On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote:
  But, by talking about it, we're just smoking pipe dreams.  Cuz we all know 
  zfs is developmentally challenged now.  But one can dream...
 
 I disagree the ZFS is developmentally challenged. There is more development
 now than ever in every way: # of developers, companies, OSes, KLOCs, features.
 Perhaps the level of maturity makes progress appear to be moving slower than
 it seems in early life?
 
  -- richard
 
 Well, perhaps a part of it is marketing.  

A lot of it is marketing :-/

 Maturity isn't really an excuse for not having a long-term feature roadmap.  
 It seems as though maturity in this case equals stagnation.  What are the 
 features being worked on we aren't aware of?

Most of the illumos-centric discussion is on the developer's list. The 
ZFSonLinux 
and BSD communities are also quite active. Almost none of the ZFS developers 
hang
out on this zfs-discuss@opensolaris.org anymore. In fact, I wonder why I'm 
still here...

  The big ones that come to mind that everyone else is talking about for not 
 just ZFS but openindiana as a whole and other storage platforms would be:
 1. SMB3 - hyper-v WILL be gaining market share over the next couple years, 
 not supporting it means giving up a sizeable portion of the market.  Not to 
 mention finally being able to run SQL (again) and Exchange on a fileshare.

I know of at least one illumos community company working on this. However, I do 
not
know their public plans.

 2. VAAI support.  

VAAI has 4 features, 3 of which have been in illumos for a long time. The 
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor 
product, 
but the CEO made a conscious (and unpopular) decision to keep that code from 
the 
community. Over the summer, another developer picked up the work in the 
community, 
but I've lost track of the progress and haven't seen an RTI yet.

 3. the long-sought bp-rewrite.

Go for it!

 4. full drive encryption support.

This is a key management issue mostly. Unfortunately, the open source code for
handling this (trousers) covers much more than keyed disks and can be unwieldy.
I'm not sure which distros picked up trousers, but it doesn't belong in the 
illumos-gate
and it doesn't expose itself to ZFS.

 5. tiering (although I'd argue caching is superior, it's still a checkbox).

You want to add tiering to the OS? That has been available for a long time via 
the
(defunct?) SAM-QFS project that actually delivered code
http://hub.opensolaris.org/bin/view/Project+samqfs/

If you want to add it to ZFS, that is a different conversation.
 -- richard

 
 There's obviously more, but those are just ones off the top of my head that 
 others are supporting/working on.  Again, it just feels like all the work is 
 going into fixing bugs and refining what is there, not adding new features.  
 Obviously Saso personally added features, but overall there don't seem to be 
 a ton of announcements to the list about features that have been added or are 
 being actively worked on.  It feels like all these companies are just adding 
 niche functionality they need that may or may not be getting pushed back to 
 mainline.
 
 /debbie-downer
 

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Richard Elling
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Oh, I forgot to mention - The above logic only makes sense for mirrors and 
 stripes.  Not for raidz (or raid-5/6/dp in general)
 
 If you have a pool of mirrors or stripes, the system isn't forced to 
 subdivide a 4k block onto multiple disks, so it works very well.  But if you 
 have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 
 disks) then the 4k block gets divided into 1k on each disk and 1k parity on 
 the parity disk.  Now, since the hardware only supports block sizes of 4k ... 
 You can see there's a lot of wasted space, and if you do a bunch of it, 
 you'll also have a lot of wasted time waiting for seeks/latency.

This is not quite true for raidz. If there is a 4k write to a raidz comprised 
of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data + 1 
parity with 75% 
space wastage. Rather, the space allocation more closely resembles a variant of 
mirroring,
like some vendors call RAID-1E
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Richard Elling
bloom filters are a great fit for this :-)

  -- richard



On Jan 19, 2013, at 5:59 PM, Nico Williams n...@cryptonector.com wrote:

 I've wanted a system where dedup applies only to blocks being written
 that have a good chance of being dups of others.
 
 I think one way to do this would be to keep a scalable Bloom filter
 (on disk) into which one inserts block hashes.
 
 To decide if a block needs dedup one would first check the Bloom
 filter, then if the block is in it, use the dedup code path, else the
 non-dedup codepath and insert the block in the Bloom filter.  This
 means that the filesystem would store *two* copies of any
 deduplicatious block, with one of those not being in the DDT.
 
 This would allow most writes of non-duplicate blocks to be faster than
 normal dedup writes, but still slower than normal non-dedup writes:
 the Bloom filter will add some cost.
 
 The nice thing about this is that Bloom filters can be sized to fit in
 main memory, and will be much smaller than the DDT.
 
 It's very likely that this is a bit too obvious to just work.
 
 Of course, it is easier to just use flash.  It's also easier to just
 not dedup: the most highly deduplicatious data (VM images) is
 relatively easy to manage using clones and snapshots, to a point
 anyways.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Richard Elling

On Jan 17, 2013, at 9:35 PM, Thomas Nau thomas@uni-ulm.de wrote:

 Thanks for all the answers more inline)
 
 On 01/18/2013 02:42 AM, Richard Elling wrote:
 On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 mailto:bfrie...@simple.dallas.tx.us wrote:
 
 On Wed, 16 Jan 2013, Thomas Nau wrote:
 
 Dear all
 I've a question concerning possible performance tuning for both iSCSI 
 access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC 
 RAM ZIL
 SSDs and 128G of main memory
 
 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some 
 cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)
 
 My only pain so far is the time a send/receive takes without really loading 
 the
 network at all. VM performance is nothing I worry about at all as it's pretty 
 good.
 So key question for me is if going from 8k to 16k or even 32k would have some 
 benefit for
 that problem?

send/receive can bottleneck on the receiving side. Take a look at the archives
searching for mbuffer as a method of buffering on the receive side. In a well
tuned system, the send will be from ARC :-)
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Richard Elling
On Jan 18, 2013, at 4:40 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-18 06:35, Thomas Nau wrote:
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some 
 cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)
 
 My only pain so far is the time a send/receive takes without really loading 
 the
 network at all. VM performance is nothing I worry about at all as it's 
 pretty good.
 So key question for me is if going from 8k to 16k or even 32k would have 
 some benefit for
 that problem?
 
 I would guess that increasing the block size would on one hand improve
 your reads - due to more userdata being stored contiguously as part of
 one ZFS block - and thus sending of the backup streams should be more
 about reading and sending the data and less about random seeking.

There is too much caching in the datapath to make a broad statement stick.
Empirical measurements with your workload will need to choose the winner.

 On the other hand, this may likely be paid off with the need to do more
 read-modify-writes (when larger ZFS blocks are partially updated with
 the smaller clusters in the VM's filesystem) while the overall system
 is running and used for its primary purpose. However, since the guest
 FS is likely to store files of non-minimal size, it is likely that the
 whole larger backend block would be updated anyway...

For many ZFS implementations, RMW for zvols is the norm.

 
 So, I think, this is something an experiment can show you - whether the
 gain during backup (and primary-job) reads vs. possible degradation
 during the primary-job writes would be worth it.
 
 As for the experiment, I guess you can always make a ZVOL with different
 recordsize, DD data into it from the production dataset's snapshot, and
 attach the VM or its clone to the newly created clone of its disk image.

In my experience, it is very hard to recreate in the lab the environments
found in real life. dd, in particular, will skew the results a bit because it
is in LBA order for zvols, not the creation order as seen in the real world.

That said, trying to get high performance out of HDDs is an exercise like
fighting the tides :-)
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Wed, 16 Jan 2013, Thomas Nau wrote:
 
 Dear all
 I've a question concerning possible performance tuning for both iSCSI access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM 
 ZIL
 SSDs and 128G of main memory
 
 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize 
 of 4K?  This seems like the most obvious improvement.

4k might be a little small. 8k will have less metadata overhead. In some cases
we've seen good performance on these workloads up through 32k. Real pain
is felt at 128k :-)

 
 [ stuff removed ]
 
 For disaster recovery we plan to sync the pool as often as possible
 to a remote location. Running send/receive after a day or so seems to take
 a significant amount of time wading through all the blocks and we hardly
 see network average traffic going over 45MB/s (almost idle 1G link).
 So here's the question: would increasing/decreasing the volblocksize improve
 the send/receive operation and what influence might show for the iSCSI side?
 
 Matching the volume block size to what the clients are actually using (due to 
 their filesystem configuration) should improve performance during normal 
 operations and should reduce the number of blocks which need to be sent in 
 the backup by reducing write amplification due to overlap blocks..

compression is a good win, too 
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling

On Jan 17, 2013, at 8:35 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-17 16:04, Bob Friesenhahn wrote:
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Matching the volume block size to what the clients are actually using
 (due to their filesystem configuration) should improve performance
 during normal operations and should reduce the number of blocks which
 need to be sent in the backup by reducing write amplification due to
 overlap blocks..
 
 
 Also, it would make sense while you are at it to verify that the
 clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
 their partitions start at a 512b-based sector offset divisible by
 8 inside the virtual HDDs, and the FS headers also align to that
 so the first cluster is 4KB-aligned.

This is the classical expectation. So I added an alignment check into
nfssvrtop and iscsisvrtop. I've looked at a *ton* of NFS workloads from
ESX and, believe it or not, alignment doesn't matter at all, at least for 
the data I've collected. I'll let NetApp wallow in the mire of misalignment
while I blissfully dream of other things :-)

 Classic MSDOS MBR did not warrant that partition start, by using
 63 sectors as the cylinder size and offset factor. Newer OSes don't
 use the classic layout, as any config is allowable; and GPT is well
 aligned as well.
 
 Overall, a single IO in the VM guest changing a 4KB cluster in its
 FS should translate to one 4KB IO in your backend storage changing
 the dataset's userdata (without reading a bigger block and modifying
 it with COW), plus some avalanche of metadata updates (likely with
 the COW) for ZFS's own bookkeeping.

I've never seen a 1:1 correlation from the VM guest to the workload
on the wire. To wit, I did a bunch of VDI and VDI-like (small, random
writes) testing on XenServer and while the clients were chugging
away doing 4K random I/Os, on the wire I was seeing 1MB NFS
writes. In part this analysis led to my cars-and-trains analysis.

In some VMware configurations, over the wire you could see a 16k
read for every 4k random write. Go figure. Fortunately, those 16k 
reads find their way into the MFU side of the ARC :-)

Bottom line: use tools like iscsisvrtop and dtrace to get an idea of
what is really happening over the wire.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HP Proliant DL360 G7

2013-01-08 Thread Richard Elling
On Jan 8, 2013, at 10:30 AM, Edmund White ewwh...@mac.com wrote:

 The D2600 and D2700 enclosures are fully supported as Nexenta JBODs. I run 
 them in multiple production environments. 

Yes, I worked on the field qualifications for these… very nice JBODs :-)

 I *could* use an HP-branded LSI controller (SC08Ge), but I prefer the higher 
 performance of the LSI 9211 and 9205e HBA's.

Many of the big-box vendors have to deal with Windows as the target OS. Until 
Server 2012,
the use of JBODs with lots of disks was challenging for Windows. Hence, they 
offer few
options for the folks who want JBOD control.
 -- richard

 
 I recently posted on Server Fault with the Nexenta console representation of 
 the HP D2700 JBOD. It's already integrated with NexentaStor.
 
 -- 
 Edmund White
 ewwh...@mac.com
 
 From: Mark - carne...@gmail.com
 Date: Tuesday, January 8, 2013 12:09 PM
 To: Sašo Kiselkov skiselkov...@gmail.com
 Cc: zfs-discuss@opensolaris.org zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] HP Proliant DL360 G7
 
 Good call Saso.  Sigh... I guess I wait to hear from HP on supported IT mode 
 HBAs in their D2000s or other jbods.
 
 
 On Tue, Jan 8, 2013 at 11:40 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 01/08/2013 04:27 PM, mark wrote:
  On Jul 2, 2012, at 7:57 PM, Richard Elling wrote:
 
  FYI, HP also sells an 8-port IT-style HBA (SC-08Ge), but it is hard to 
  locate
  with their configurators. There might be a more modern equivalent cleverly
  hidden somewhere difficult to find.
   -- richard
 
 
  Richard,
 
  Do you know if the HBAs in HP controllers be swapped out with any well
  characterized (by nexenta) HBAs like the 9211-8e or do they require a 
  specific
  'controller HBA' like the SC-08Ge?  IE, does it void the warranty if you 
  open up
  the controller and stick a third party card in there?  Did you ever try to
  'bypass' the controllers at all and just plug into an expander?  I prefer HP
  hardware also but the controller is getting in the way.
 
  Ill be asking HP the same questions in the next few weeks with any luck but 
  your
  opinion and experiences are on another level compared to HPs pre-sales
  department... not that theyre bad but in this realm youre the man :)
 
 I know you didn't ask me, but I can tell you my experience: it depends
 on what you mean by warranty. If you mean as in warranty on sales of
 goods (as mandated by law), then no, sticking a different HBA in your
 servers does not void your warranty (unless this is expressly labeled on
 the product - manufacturers typically also put protective labels on
 screws then).
 
 When it comes to support services, though, such as phone support and
 firmware updates, then yes, using a third-party HBA can make these
 difficult and/or impossible. HP storage enclosure and drive firmware,
 for example, can only be flashed through an HP-branded SmartArray card.
 
 Depending on what software you are running on the machines it can make
 no difference at all, or a lot of difference. For instance, if you're
 running proprietary storage controller software on the server (think
 something like NexentaStor, but from the HW vendor), then your custom
 HBA might simply be flat out unsupported and the only response you'll
 get from the vendor support team is stick the card we shipped it with
 back in. OTOH if you're running something not HW vendor-specific (like
 the aforementioned NexentaStor, or any other Illumos variant), and the
 HW vendor at least gives lip service to supporting your platform (always
 tell the support folk you're running Solaris), then chances are that
 your support contract will be just as valid as before. I've had drives
 fail on Dell machines and each time support was happy when I just told
 them drive dead, running Solaris, here's the log output, send a new one
 please.
 
 Cheers,
 --
 Saso
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpt_sas multipath problem?

2013-01-07 Thread Richard Elling

On Jan 7, 2013, at 1:20 PM, Marion Hakanson hakan...@ohsu.edu wrote:

 Greetings,
 
 We're trying out a new JBOD here.  Multipath (mpxio) is not working,
 and we could use some feedback and/or troubleshooting advice.

Sometimes the mpxio detection doesn't work properly. You can try to
whitelist them,
https://www.illumos.org/issues/644

 -- richard

 
 The OS is oi151a7, running on an existing server with a 54TB pool
 of internal drives.  I believe the server hardware is not relevant
 to the JBOD issue, although the internal drives do appear to the
 OS with multipath device names (despite the fact that these
 internal drives are cabled up in a single-path configuration).  If
 anything, this does confirm that multipath is enabled in mpt_sas.conf
 via the mpxio-disable=no directive (internal HBA's are LSI SAS,
 2x 9201-16i and 1x 9211-8i).
 
 The JBOD is a SuperMicro 847E26-RJBOD1, with the front backplane
 daisy-chained to the rear backplane (both expanders).  Each of the two
 expander chains is connected to one port of an LSI SAS 9200-8e HBA.  So
 far, all this hardware has appeared as working for others and well-supported,
 and this 9200-8e is running the -IT firmware, version 15.0.0.0.
 
 The drives are 40x of the WD4001FYYG SAS 4TB variety, firmware VR02.
 The spot-checks I've done so far seem to show that both device instances
 of a drive show up in prtconf -Dv with identical serial numbers and
 identical devid and guid values, so I'm not sure what might be
 missing to allow mpxio to recognize them as the same device.
 
 Has anyone out there got this type of hardware working?  In a multipath
 configuration?  Suggestions on mdb or dtrace code I can use to debug?
 Are there secrets to the internal daisy-chain cabling that our vendor
 is not aware of?
 
 Thanks and regards,
 
 Marion
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool layout vs resilver times

2013-01-05 Thread Richard Elling
On Jan 5, 2013, at 9:42 AM, Russ Poyner rpoy...@engr.wisc.edu wrote:

 I'm configuring a box with 24x 3Tb consumer SATA drives, and wondering about 
 the best way to configure the pool. The customer wants capacity on the cheap, 
 and I want something I can service without sweating too much about data loss. 
 Due to capacity concerns raid 10 is probably out, which leaves various raidz 
 choices

You should consider space, data dependability as measured by Mean Time to Data 
Loss (MTTDL), and performance.

For MTTDL[1] model, let's use 700k hours MTBF for the disks and 168 hours for 
recovery (48 hours logistical + 120 hours resilver of a full disk)

For performance, lets hope for 7,200 rpms and about 80 IOPS for small, random 
reads
with 100% cache miss.

 
 1. A stripe of four 6 disk raidz2

Option 1
space ~= 4 * (6 - 2) * 3TB = 48 TB
MTTDL[1] = 8.38e+5 years or  0.000119% Annualized Failure Rate (AFR)
small, random read performance, best of worst case = 4 * (6/4) * 80 
IOPS = 480 IOPS

 2. A stripe of two 11 disk raidz3 with 2 hot spares.

Option 2
space ~= 2 * (11 - 3) * 3TB = 48 TB
MTTDL[1] =  3.62e+7 years or 0.03% AFR 
small, random read performance, best of worst case = 2 * (11/8) * 80 
IOPS =  220 IOPS

Option 2a (no hot spares)
space ~= 2 * (12 - 3) * 3TB = 54 TB
MTTDL[2] = 1.90e+7 years or 0.05% AFR
small, random read performance, best of worst case = 2 * (12/9) * 80 
IOPS =  213 IOPS

 
 Other, better ideas?

There are thousands of permutations you could consider :-)
For 24-bay systems with double parity or better, we also see a 3x8-disk as a
common configuration. Offhand, I'd say we see more 4x6-disk and 3x8-disk
configs than any configs with more than 10 disks per set.

 
 My questions are
 
 A. How long will resilvering take with these layouts when the disks start 
 dying?

It depends on the concurrent workload. By default resilvers are throttled and 
give
way to other workload. In general, for double or triple parity RAID, you don't 
need
to worry too much on a per-disk basis. The conditions you need to worry about 
are
where the failure cause is common to all disks, such as a controller, fans, 
cabling,
or power because they are more likely than a triple failure of disks (as 
clearly shown
by the MTTDL[1] model results above)

 
 B. Should I prefer hot spares or additional parity drives, and why?

In general, addional parity is better than hot spares. You get more performance
and better data dependability.

 
 The box is a supermicro with 36 bays controlled through a single LSI 9211-8i. 
 There is a separate intel 320 ssd for the OS. The purpose is to backup data 
 from the customer's windows workstations. I'm leaning toward using BackupPC 
 for the backups since it seems to combine good efficiency with a fairly 
 customer-friendly web interface.

Sounds like a good plan.
 -- richard

 
 I'm running FreeBSD 9, after having failed to get the plugin jail working in 
 FreeNAS, also for personal reasons I find csh easier to use than the FreeNAS 
 web interface. My impression is that FreeBSD combines a mature OS with the 
 second oldest/best (after Illumos) free implementation of zfs.
 
 Thanks in advance
 Russ Poyner
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)

2013-01-04 Thread Richard Elling
On Jan 4, 2013, at 11:12 AM, Robert Milkowski rmilkow...@task.gda.pl wrote:

 
 Illumos is not so good at dealing with huge memory systems but perhaps
 it is also more stable as well.
 
 Well, I guess that it depends on your environment, but generally I would
 expect S11 to be more stable if only because the sheer amount of bugs
 reported by paid customers and bug fixes by Oracle that Illumos is not
 getting (lack of resource, limited usage, etc.).


There is a two-edged sword. Software reliability analysis shows that the 
most reliable software is the software that is oldest and unchanged. But 
people also want new functionality. So while Oracle has more changes
being implemented in Solaris, it is destabilizing while simultaneously
improving reliability. Unfortunately, it is hard to get both wins. What is more
likely is that new features are being driven into Solaris 11 that are 
destabilizing. By contrast, the number of new features being added to
illumos-gate (not to be confused with illumos-based distros) is relatively
modest and in all cases are not gratuitous.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VDI iops with caching

2013-01-04 Thread Richard Elling
On Jan 3, 2013, at 8:38 PM, Geoff Nordli geo...@gnaa.net wrote:

 Thanks Richard, Happy New Year.
 
 On 13-01-03 09:45 AM, Richard Elling wrote:
 On Jan 2, 2013, at 8:45 PM, Geoff Nordli geo...@gnaa.net wrote:
 
 I am looking at the performance numbers for the Oracle VDI admin guide.
 
 http://docs.oracle.com/html/E26214_02/performance-storage.html
 
 From my calculations for 200 desktops running Windows 7 knowledge user (15 
 iops) with a 30-70 read/write split it comes to 5100 iops. Using 7200 rpm 
 disks the requirement will be 68 disks.
 
 This doesn't seem right, because if you are using clones with caching, you 
 should be able to easily satisfy your reads from ARC and L2ARC.  As well, 
 Oracle VDI by default caches writes; therefore the writes will be coalesced 
 and there will be no ZIL activity.
 
 All of these IOPS -- VDI users guidelines are wrong. The problem is that 
 the variability of
 response time is too great for a HDD. The only hope we have of getting the 
 back-of-the-napkin
 calculations to work is to reduce the variability by using a device that is 
 more consistent in its
 response (eg SSDs).
 
 For sure there is going to be a lot of variability, but it seems we aren't 
 even close.  
 
 Have you seen any back-of-the-napkin calculations which take into 
 consideration SSDs for cache usage? 

Yes. I've written a white paper on the subject, somewhere on the nexenta.com 
website (if it is still available).
But more current info is presentation at ZFSday.
http://www.youtube.com/watch?v=A4yrSfaskwI
http://www.slideshare.net/relling

 
 Anyone have other guidelines on what they are seeing for iops with vdi?
 
 
 The successful VDI implementations I've seen have relatively small space 
 requirements for
 the performance-critical work. So there are a bunch  of companies offering 
 SSD-based arrays
 for that market. If you're stuck with HDDs, then effective use of 
 snapshots+clones with a few
 GB of RAM and slog can support quite a few desktops.
  -- richard
 
 
 Yes, I would like to stick with HDDs. 
 
 I am just not quite sure what quite a few desktops mean.  
 
 I thought for sure there would be lots of people around that have done small 
 deployments using a standard ZFS deployment.  

... and large :-)  I did 100 desktops with 2 SSDs two years ago. The 
presentation was given at
OpenStorage Summit 2010. I don't think there is a video, though :-(.

Fundamentally, people like to use sizing in IOPS, but all IOPS are not created 
equal. An I/O
satisfied by ARC is often limited by network bandwidth constraints whereas an 
I/O that hits a
slow pool is often limited by HDD latency. The two are 5 orders of magnitude 
different when
using HDDs in the pool.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VDI iops with caching

2013-01-03 Thread Richard Elling
On Jan 2, 2013, at 8:45 PM, Geoff Nordli geo...@gnaa.net wrote:

 I am looking at the performance numbers for the Oracle VDI admin guide.
 
 http://docs.oracle.com/html/E26214_02/performance-storage.html
 
 From my calculations for 200 desktops running Windows 7 knowledge user (15 
 iops) with a 30-70 read/write split it comes to 5100 iops. Using 7200 rpm 
 disks the requirement will be 68 disks.
 
 This doesn't seem right, because if you are using clones with caching, you 
 should be able to easily satisfy your reads from ARC and L2ARC.  As well, 
 Oracle VDI by default caches writes; therefore the writes will be coalesced 
 and there will be no ZIL activity.

All of these IOPS -- VDI users guidelines are wrong. The problem is that the 
variability of
response time is too great for a HDD. The only hope we have of getting the 
back-of-the-napkin
calculations to work is to reduce the variability by using a device that is 
more consistent in its
response (eg SSDs).

 
 Anyone have other guidelines on what they are seeing for iops with vdi?
 

The successful VDI implementations I've seen have relatively small space 
requirements for
the performance-critical work. So there are a bunch  of companies offering 
SSD-based arrays
for that market. If you're stuck with HDDs, then effective use of 
snapshots+clones with a few
GB of RAM and slog can support quite a few desktops.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] poor CIFS and NFS performance

2013-01-03 Thread Richard Elling

On Jan 3, 2013, at 12:33 PM, Eugen Leitl eu...@leitl.org wrote:

 On Sun, Dec 30, 2012 at 06:02:40PM +0100, Eugen Leitl wrote:
 
 Happy $holidays,
 
 I have a pool of 8x ST31000340AS on an LSI 8-port adapter as
 
 Just a little update on the home NAS project.
 
 I've set the pool sync to disabled, and added a couple
 of
 
   8. c4t1d0 ATA-INTELSSDSA2M080-02G9 cyl 11710 alt 2 hd 224 sec 56
  /pci@0,0/pci1462,7720@11/disk@1,0
   9. c4t2d0 ATA-INTELSSDSA2M080-02G9 cyl 11710 alt 2 hd 224 sec 56
  /pci@0,0/pci1462,7720@11/disk@2,0

Setting sync=disabled means your log SSDs (slogs) will not be used.
 -- richard

 
 I had no clue what the partitions names (created with napp-it web
 interface, a la 5% log and 95% cache, of 80 GByte) were and so
 did a iostat -xnp
 
1.40.35.50.0  0.0  0.00.00.0   0   0 c4t1d0
0.10.03.70.0  0.0  0.00.00.5   0   0 c4t1d0s2
0.10.02.60.0  0.0  0.00.00.5   0   0 c4t1d0s8
0.00.00.00.0  0.0  0.00.00.2   0   0 c4t1d0p0
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t1d0p1
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t1d0p2
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t1d0p3
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t1d0p4
1.20.31.40.0  0.0  0.00.00.0   0   0 c4t2d0
0.00.00.60.0  0.0  0.00.00.4   0   0 c4t2d0s2
0.00.00.70.0  0.0  0.00.00.4   0   0 c4t2d0s8
0.10.00.00.0  0.0  0.00.00.2   0   0 c4t2d0p0
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t2d0p1
0.00.00.00.0  0.0  0.00.00.0   0   0 c4t2d0p2
 
 then issued
 
 # zpool add tank0 cache /dev/dsk/c4t1d0p1 /dev/dsk/c4t2d0p1
 # zpool add tank0 log mirror /dev/dsk/c4t1d0p0 /dev/dsk/c4t2d0p0
 
 which resulted in 
 
 root@oizfs:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h1m with 0 errors on Wed Jan  2 21:09:23 2013
 config:
 
NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c4t3d0s0  ONLINE   0 0 0
 
 errors: No known data errors
 
  pool: tank0
 state: ONLINE
  scan: scrub repaired 0 in 5h17m with 0 errors on Wed Jan  2 17:53:20 2013
 config:
 
NAME   STATE READ WRITE CKSUM
tank0  ONLINE   0 0 0
  raidz3-0 ONLINE   0 0 0
c3t5000C500098BE9DDd0  ONLINE   0 0 0
c3t5000C50009C72C48d0  ONLINE   0 0 0
c3t5000C50009C73968d0  ONLINE   0 0 0
c3t5000C5000FD2E794d0  ONLINE   0 0 0
c3t5000C5000FD37075d0  ONLINE   0 0 0
c3t5000C5000FD39D53d0  ONLINE   0 0 0
c3t5000C5000FD3BC10d0  ONLINE   0 0 0
c3t5000C5000FD3E8A7d0  ONLINE   0 0 0
logs
  mirror-1 ONLINE   0 0 0
c4t1d0p0   ONLINE   0 0 0
c4t2d0p0   ONLINE   0 0 0
cache
  c4t1d0p1 ONLINE   0 0 0
  c4t2d0p1 ONLINE   0 0 0
 
 errors: No known data errors
 
 which resulted in bonnie++
 befo':
 
 NAME   SIZEBonnie  Date(y.m.d) FileSeq-Wr-Chr  %CPU
 Seq-Write   %CPUSeq-Rewr%CPUSeq-Rd-Chr  %CPU
 Seq-Read%CPURnd Seeks   %CPUFiles   Seq-Create  
 Rnd-Create
 rpool  59.5G   start   2012.12.28  15576M  24 MB/s 61  47 
 MB/s 18  40 MB/s 19  26 MB/s 98  273 MB/s 
48  2657.2/s25  16  12984/s 12058/s
 tank0  7.25T   start   2012.12.29  15576M  35 MB/s 86  145 
 MB/s48  109 MB/s50  25 MB/s 97  291 MB/s  
   53  819.9/s 12  16  12634/s 9194/s
 
 aftuh:
 
 -Wr-Chr%CPUSeq-Write   %CPUSeq-Rewr%CPU
 Seq-Rd-Chr  %CPUSeq-Read%CPURnd Seeks   %CPUFiles 
   Seq-Create  Rnd-Create
 rpool  59.5G   start   2012.12.28  15576M  24 MB/s 61  47 
 MB/s 18  40 MB/s 19  26 MB/s 98  273 MB/s 
48  2657.2/s25  16  12984/s 12058/s
 tank0  7.25T   start   2013.01.03  15576M  35 MB/s 86  149 
 MB/s48  111 MB/s50  26 MB/s 98  404 MB/s  
   76  1094.3/s12  16  12601/s 9937/s
 
 Does the layout make sense? Do the stats make sense, or is there still 
 something very wrong
 with that pool?
 
 Thanks. 
 ___
 zfs-discuss mailing 

Re: [zfs-discuss] poor CIFS and NFS performance

2013-01-02 Thread Richard Elling

On Jan 2, 2013, at 2:03 AM, Eugen Leitl eu...@leitl.org wrote:

 On Sun, Dec 30, 2012 at 10:40:39AM -0800, Richard Elling wrote:
 On Dec 30, 2012, at 9:02 AM, Eugen Leitl eu...@leitl.org wrote:
 
 The system is a MSI E350DM-E33 with 8 GByte PC1333 DDR3
 memory, no ECC. All the systems have Intel NICs with mtu 9000
 enabled, including all switches in the path.
 
 Does it work faster with the default MTU?
 
 No, it was even slower, that's why I went from 1500 to 9000.
 I estimate it brought ~20 MByte/s more peak on Windows 7 64 bit CIFS.

OK, then you have something else very wrong in your network.

 Also check for retrans and errors, using the usual network performance
 debugging checks.
 
 Wireshark or tcpdump on Linux/Windows? What would
 you suggest for OI?

Look at all of the stats for all NICs and switches on both ends of each wire.
Look for collisions (should be 0), drops (should be 0), dups (should be 0),
retrans (should be near 0), flow control (server shouldn't see flow control
activity), etc. There is considerable written material on how to diagnose
network flakiness.

 
 P.S. Not sure whether this is pathological, but the system
 does produce occasional soft errors like e.g. dmesg
 
 More likely these are due to SMART commands not being properly handled
 
 Otherwise napp-it attests full SMART support.
 
 for SATA devices. They are harmless.


Yep, this is a SATA/SAS/SMART interaction where assumptions are made
that might not be true. Usually it means that the SMART probes are using SCSI
commands on SATA disks.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] poor CIFS and NFS performance

2012-12-30 Thread Richard Elling
On Dec 30, 2012, at 9:02 AM, Eugen Leitl eu...@leitl.org wrote:

 
 Happy $holidays,
 
 I have a pool of 8x ST31000340AS on an LSI 8-port adapter as
 a raidz3 (no compression nor dedup) with reasonable bonnie++ 
 1.03 values, e.g.  145 MByte/s Seq-Write @ 48% CPU and 291 MByte/s 
 Seq-Read @ 53% CPU. It scrubs with 230+ MByte/s with reasonable
 system load. No hybrid pools yet. This is latest beta napp-it 
 on OpenIndiana 151a5 server, living on a dedicated 64 GByte SSD.
 
 The system is a MSI E350DM-E33 with 8 GByte PC1333 DDR3
 memory, no ECC. All the systems have Intel NICs with mtu 9000
 enabled, including all switches in the path.

Does it work faster with the default MTU?
Also check for retrans and errors, using the usual network performance
debugging checks.

 
 My problem is pretty poor network throughput. An NFS
 mount on 12.04 64 bit Ubuntu (mtu 9000) or CIFS are
 read at about 23 MBytes/s. Windows 7 64 bit (also jumbo
 frames) reads at about 65 MBytes/s. The highest transfer
 speed on Windows just touches 90 MByte/s, before falling
 back to the usual 60-70 MBytes/s.
 
 I kinda can live with above values, but I have a feeling
 the setup should be able to saturate GBit Ethernet with
 large file transfers, especially on Linux (20 MByte/s
 is nothing to write home about).
 
 Does anyone have any suggestions on how to debug/optimize
 throughput?
 
 Thanks, and happy 2013.
 
 P.S. Not sure whether this is pathological, but the system
 does produce occasional soft errors like e.g. dmesg

More likely these are due to SMART commands not being properly handled
for SATA devices. They are harmless.
 -- richard

 
 Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0
  Error Block: 0
 Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA   
  Serial Number:  
 Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
 Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor 
 unique code 0x0), ASCQ: 0x1d, FRU: 0x0
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.warning] WARNING: 
 /scsi_vhci/disk@g5000c50009c72c48 (sd9):
 Dec 30 17:45:01 oizfs   Error for Command: undecoded cmd 0xa1Error 
 Level: Recovered
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0
  Error Block: 0
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA   
  Serial Number:  
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor 
 unique code 0x0), ASCQ: 0x1d, FRU: 0x0
 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) 
 instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 0
 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) 
 instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 1
 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) 
 instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 0
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.warning] WARNING: 
 /scsi_vhci/disk@g5000c50009c73968 (sd4):
 Dec 30 17:45:01 oizfs   Error for Command: undecoded cmd 0xa1Error 
 Level: Recovered
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0
  Error Block: 0
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA   
  Serial Number:  
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor 
 unique code 0x0), ASCQ: 0x1d, FRU: 0x0
 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.warning] WARNING: 
 /scsi_vhci/disk@g5000c500098be9dd (sd10):
 Dec 30 17:45:03 oizfs   Error for Command: undecoded cmd 0xa1Error 
 Level: Recovered
 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0
  Error Block: 0
 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA   
  Serial Number:  
 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor 
 unique code 0x0), ASCQ: 0x1d, FRU: 0x0
 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.warning] WARNING: 
 /pci@0,0/pci1462,7720@11/disk@3,0 (sd8):
 Dec 30 17:45:04 oizfs   Error for Command: undecoded cmd 0xa1Error 
 Level: Recovered
 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0
  Error Block: 0
 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA   
  Serial Number:  
 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error
 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice]  

Re: [zfs-discuss] ZFS QoS and priorities

2012-12-06 Thread Richard Elling
On Dec 6, 2012, at 5:30 AM, Matt Van Mater matt.vanma...@gmail.com wrote:

 
 
 I'm unclear on the best way to warm data... do you mean to simply `dd 
 if=/volumes/myvol/data of=/dev/null`?  I have always been under the 
 impression that ARC/L2ARC has rate limiting how much data can be added to the 
 cache per interval (i can't remember the interval).  Is this not the case?  
 If there is some rate limiting in place, dd-ing the data like my example 
 above would not necessarily cache all of the data... it might take several 
 iterations to populate the cache, correct?  
 
 Quick update... I found at least one reference to the rate limiting I was 
 referring to.  It was Richard from ~2.5 years ago :)
 http://marc.info/?l=zfs-discussm=127060523611023w=2
 
 I assume the source code reference is still valid, in which case a population 
 of 8MB per 1 second into L2ARC is extremely slow in my books and very 
 conservative... It would take a very long time to warm the hundreds of gigs 
 of VMs we have into cache.  Perhaps the L2ARC_WRITE_BOOST tunable might be a 
 good place to aggressively warm a cache, but my preference is to not touch 
 the tunables if I have a choice.  I'd rather the system default be updated to 
 reflect modern hardware, that way everyone benefits and I'm not running some 
 custom build.

Yep, the default L2ARC fill rate is quite low for modern systems. It is not 
uncommon
to see it increased significantly, with the corresponding improvements in hit 
rate for
busy systems. Can you file an RFE at 
https://www.illumos.org/projects/illumos-gate/issues/
Thanks!
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-12-05 Thread Richard Elling
On Dec 5, 2012, at 5:41 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2012-12-05 04:11, Richard Elling wrote:
 On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru
 mailto:jimkli...@cos.ru wrote:
 
 I've heard a claim that ZFS relies too much on RAM caching, but
 implements no sort of priorities (indeed, I've seen no knobs to
 tune those) - so that if the storage box receives many different
 types of IO requests with different administrative weights in
 the view of admins, it can not really throttle some IOs to boost
 others, when such IOs have to hit the pool's spindles.
 
 Caching has nothing to do with QoS in this context. *All* modern
 filesystems cache to RAM, otherwise they are unusable.
 
 Yes, I get that. However, many systems get away with less RAM
 than recommended for ZFS rigs (like the ZFS SA with a couple
 hundred GB as the starting option), and make their compromises
 elsewhere. They have to anyway, and they get different results,
 perhaps even better suited to certain narrow or big niches.

This is nothing more than a specious argument. They have small 
caches, so their performance is not as good as those with larger 
caches. This is like saying you need a smaller CPU cache because
larger CPU caches get full.

 Whatever the aggregate result, this difference does lead to
 some differing features that The Others' marketing trumpets
 praise as the advantage :) - like this ability to mark some
 IO traffic as of higher priority than other traffics, in one
 case (which is now also an Oracle product line, apparently)...
 
 Actually, this question stems from a discussion at a seminar
 I've recently attended - which praised ZFS but pointed out its
 weaknesses against some other players on the market, so we are
 not unaware of those.
 
 For example, I might want to have corporate webshop-related
 databases and appservers to be the fastest storage citizens,
 then some corporate CRM and email, then various lower priority
 zones and VMs, and at the bottom of the list - backups.
 
 Please read the papers on the ARC and how it deals with MFU and
 MRU cache types. You can adjust these policies using the primarycache
 and secondarycache properties at the dataset level.
 
 I've read on that, and don't exactly see how much these help
 if there is pressure on RAM so that cache entries expire...
 Meaning, if I want certain datasets to remain cached as long
 as possible (i.e. serve website or DB from RAM, not HDD), at
 expense of other datasets that might see higher usage, but
 have lower business priority - how do I do that? Or, perhaps,
 add (L2)ARC shares, reservations and/or quotas concepts to the
 certain datasets which I explicitly want to throttle up or down?

MRU evictions take precedence over MFU evictions. If the data is 
not in MFU, then it is, by definition, not being frequently used.

 At most, now I can mark the lower-priority datasets' data or
 even metadata as not cached in ARC or L2ARC. On-off. There seems
 to be no smaller steps, like in QoS tags [0-7] or something like
 that.
 
 BTW, as a short side question: is it a true or false statement,
 that: if I set primarycache=metadata, then ZFS ARC won't cache
 any userdata and thus it won't appear in (expire into) L2ARC?
 So the real setting is that I can cache data+meta in RAM, and
 only meta in SSD? Not the other way around (meta in RAM but
 both data+meta in SSD)?

That is correct, by my reading of the code.

 
 AFAIK, now such requests would hit the ARC, then the disks if
 needed - in no particular order. Well, can the order be made
 particular with current ZFS architecture, i.e. by setting
 some datasets to have a certain NICEness or another priority
 mechanism?
 
 ZFS has a priority-based I/O scheduler that works at the DMU level.
 However, there is no system call interface in UNIX that transfers
 priority or QoS information (eg read() or write()) into the file system VFS
 interface. So the grainularity of priority control is by zone or dataset.
 
 I do not think I've seen mention of priority controls per dataset,
 at least not in generic ZFS. Actually, that was part of my question
 above. And while throttling or resource shares between higher level
 software components (zones, VMs) might have similar effect, this is
 not something really controlled and enforced by the storage layer.

The priority scheduler is by type of I/O request. For example, sync 
requests have priority over async requests. Reads and writes have
priority over scrubbing etc. The inter-dataset scheduling is done at
the zone level.

There is more work being done in this area, but it is still in the research
phase.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-12-05 Thread Richard Elling
On Dec 5, 2012, at 7:46 AM, Matt Van Mater matt.vanma...@gmail.com wrote:

 I don't have anything significant to add to this conversation, but wanted to 
 chime in that I also find the concept of a QOS-like capability very appealing 
 and that Jim's recent emails resonate with me.  You're not alone!  I believe 
 there are many use cases where a granular prioritization that controls how 
 ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a 
 specific zvol, share, etc would be useful.  My experience is stronger in the 
 networking side and I envision a weighted class based queuing methodology (or 
 something along those lines).  I recognize that ZFS's architecture preference 
 for coalescing writes and reads into larger sequential batches might conflict 
 with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good 
 starting point towards that end?

At present, I do not see async write QoS as being interesting. That leaves sync 
writes and reads
as the managed I/O. Unfortunately, with HDDs, the variance in response time  
queue management
time, so the results are less useful than the case with SSDs. Control theory 
works, once again.
For sync writes, they are often latency-sensitive and thus have the highest 
priority. Reads have
lower priority, with prefetch reads at lower priority still.

 
 On a related note (maybe?) I would love to see pool-wide settings that 
 control how aggressively data is added/removed form ARC, L2ARC, etc.

Evictions are done on an as-needed basis. Why would you want to evict more than 
needed?
So you could fetch it again?

Prefetching can be more aggressive, but we actually see busy systems disabling 
prefetch to 
improve interactive performance. Queuing theory works, once again.

  Something that would accelerate the warming of a cold pool of storage or be 
 more aggressive in adding/removing cached data on a volatile dataset (e.g. 
 where Virtual Machines are turned on/off frequently).  I have heard that some 
 of these defaults might be changed in some future release of Illumos, but 
 haven't seen any specifics saying that the idea is nearing fruition in 
 release XYZ.

It is easy to warm data (dd), even to put it into MRU (dd + dd). For best 
performance with
VMs, MRU works extremely well, especially with clones.

There are plenty of good ideas being kicked around here, but remember that to 
support
things like QoS at the application level, the applications must be written to 
an interface
that passes QoS hints all the way down the stack. Lacking these interfaces, 
means that
QoS needs to be managed by hand... and that management effort must be worth the 
effort.
 -- richard

 
 Matt
 
 
 On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov jimkli...@cos.ru wrote:
 On 2012-11-29 10:56, Jim Klimov wrote:
 For example, I might want to have corporate webshop-related
 databases and appservers to be the fastest storage citizens,
 then some corporate CRM and email, then various lower priority
 zones and VMs, and at the bottom of the list - backups.
 
 On a side note, I'm now revisiting old ZFS presentations collected
 over the years, and one suggested as TBD statements the ideas
 that metaslabs with varying speeds could be used for specific
 tasks, and not only to receive the allocations first so that a new
 pool would perform quickly. I.e. TBD: Workload specific freespace
 selection policies.
 
 Say, I create a new storage box and lay out some bulk file, backup
 and database datasets. Even as they are receiving their first bytes,
 I have some idea about the kind of performance I'd expect from them -
 with QoS per dataset I might destine the databases to the fast LBAs
 (and smaller seeks between tracks I expect to use frequently), and
 the bulk data onto slower tracks right from the start, and the rest
 of unspecified data would grow around the middle of the allocation
 range.
 
 These types of data would then only creep onto the less fitting
 metaslabs (faster for bulk, slower for DB) if the target ones run
 out of free space. Then the next-best-fitting would be used...
 
 This one idea is somewhat reminiscent of hierarchical storage
 management, except that it is about static allocation at the
 write-time and takes place within the single disk (or set of
 similar disks), in order to warrant different performance for
 different tasks.
 
 ///Jim
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-12-05 Thread Richard Elling
bug fix below...

On Dec 5, 2012, at 1:10 PM, Richard Elling richard.ell...@gmail.com wrote:

 On Dec 5, 2012, at 7:46 AM, Matt Van Mater matt.vanma...@gmail.com wrote:
 
 I don't have anything significant to add to this conversation, but wanted to 
 chime in that I also find the concept of a QOS-like capability very 
 appealing and that Jim's recent emails resonate with me.  You're not alone!  
 I believe there are many use cases where a granular prioritization that 
 controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority 
 IO to a specific zvol, share, etc would be useful.  My experience is 
 stronger in the networking side and I envision a weighted class based 
 queuing methodology (or something along those lines).  I recognize that 
 ZFS's architecture preference for coalescing writes and reads into larger 
 sequential batches might conflict with a QOS-like capability... Perhaps the 
 ARC/L2ARC tuning might be a good starting point towards that end?
 
 At present, I do not see async write QoS as being interesting. That leaves 
 sync writes and reads
 as the managed I/O. Unfortunately, with HDDs, the variance in response time 
  queue management
 time, so the results are less useful than the case with SSDs. Control theory 
 works, once again.
 For sync writes, they are often latency-sensitive and thus have the highest 
 priority. Reads have
 lower priority, with prefetch reads at lower priority still.
 
 
 On a related note (maybe?) I would love to see pool-wide settings that 
 control how aggressively data is added/removed form ARC, L2ARC, etc.
 
 Evictions are done on an as-needed basis. Why would you want to evict more 
 than needed?
 So you could fetch it again?
 
 Prefetching can be more aggressive, but we actually see busy systems 
 disabling prefetch to 
 improve interactive performance. Queuing theory works, once again.
 
  Something that would accelerate the warming of a cold pool of storage or be 
 more aggressive in adding/removing cached data on a volatile dataset (e.g. 
 where Virtual Machines are turned on/off frequently).  I have heard that 
 some of these defaults might be changed in some future release of Illumos, 
 but haven't seen any specifics saying that the idea is nearing fruition in 
 release XYZ.
 
 It is easy to warm data (dd), even to put it into MRU (dd + dd). For best 
 performance with
 VMs, MRU works extremely well, especially with clones.

Should read:
It is easy to warm data (dd), even to put it into MFU (dd + dd). For best 
performance with
VMs, MFU works extremely well, especially with clones.
 -- richard

 
 There are plenty of good ideas being kicked around here, but remember that to 
 support
 things like QoS at the application level, the applications must be written to 
 an interface
 that passes QoS hints all the way down the stack. Lacking these interfaces, 
 means that
 QoS needs to be managed by hand... and that management effort must be worth 
 the effort.
  -- richard
 
 
 Matt
 
 
 On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov jimkli...@cos.ru wrote:
 On 2012-11-29 10:56, Jim Klimov wrote:
 For example, I might want to have corporate webshop-related
 databases and appservers to be the fastest storage citizens,
 then some corporate CRM and email, then various lower priority
 zones and VMs, and at the bottom of the list - backups.
 
 On a side note, I'm now revisiting old ZFS presentations collected
 over the years, and one suggested as TBD statements the ideas
 that metaslabs with varying speeds could be used for specific
 tasks, and not only to receive the allocations first so that a new
 pool would perform quickly. I.e. TBD: Workload specific freespace
 selection policies.
 
 Say, I create a new storage box and lay out some bulk file, backup
 and database datasets. Even as they are receiving their first bytes,
 I have some idea about the kind of performance I'd expect from them -
 with QoS per dataset I might destine the databases to the fast LBAs
 (and smaller seeks between tracks I expect to use frequently), and
 the bulk data onto slower tracks right from the start, and the rest
 of unspecified data would grow around the middle of the allocation
 range.
 
 These types of data would then only creep onto the less fitting
 metaslabs (faster for bulk, slower for DB) if the target ones run
 out of free space. Then the next-best-fitting would be used...
 
 This one idea is somewhat reminiscent of hierarchical storage
 management, except that it is about static allocation at the
 write-time and takes place within the single disk (or set of
 similar disks), in order to warrant different performance for
 different tasks.
 
 ///Jim
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo

Re: [zfs-discuss] ZFS QoS and priorities

2012-12-04 Thread Richard Elling
On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru wrote:

 I've heard a claim that ZFS relies too much on RAM caching, but
 implements no sort of priorities (indeed, I've seen no knobs to
 tune those) - so that if the storage box receives many different
 types of IO requests with different administrative weights in
 the view of admins, it can not really throttle some IOs to boost
 others, when such IOs have to hit the pool's spindles.

Caching has nothing to do with QoS in this context. *All* modern
filesystems cache to RAM, otherwise they are unusable.

 
 For example, I might want to have corporate webshop-related
 databases and appservers to be the fastest storage citizens,
 then some corporate CRM and email, then various lower priority
 zones and VMs, and at the bottom of the list - backups.

Please read the papers on the ARC and how it deals with MFU and
MRU cache types. You can adjust these policies using the primarycache
and secondarycache properties at the dataset level.

 
 AFAIK, now such requests would hit the ARC, then the disks if
 needed - in no particular order. Well, can the order be made
 particular with current ZFS architecture, i.e. by setting
 some datasets to have a certain NICEness or another priority
 mechanism?

ZFS has a priority-based I/O scheduler that works at the DMU level.
However, there is no system call interface in UNIX that transfers
priority or QoS information (eg read() or write()) into the file system VFS
interface. So the grainularity of priority control is by zone or dataset.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-12-01 Thread Richard Elling
On Dec 1, 2012, at 6:54 PM, Nikola M. minik...@gmail.com wrote:

 On 12/ 2/12 03:24 AM, Nikola M. wrote:
 It is using Solaris Zones and throttling their disk usage on that level,
 so you separate workload processes on separate zones.
 Or even put KVM machines under the zones (Joyent and OI support 
 Joyent-written KVM/Intel implementation in Illumos)  for the same reason of 
 I/O throttling.
 
 They (Joyent) say that their solution is made in not too much code, but 
 gives very good results (they run massive cloud computing service, with many 
 zones and KVM VM's so they might know).
 http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle
 http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/
 
 There is short video from 16th minute onward, from BayLISA meetup at Joyent, 
 August 16, 2012
 https://www.youtube.com/watch?v=6csFi0D5eGY
 Talking about ZFS Throttle implementation architecture in Illumos , from 
 Joyent's Smartos.

There was a good presentation on this at the OpenStorage Summit in 2011.
Look for it on youtube.

 I learned it is also available in Entic.net-sponsored Openindiana
 and probably in Nexenta, too, since it is implemented inside Illumos.

NexentaStor 3.x is not an illumos-based distribution, it is based on OpenSolaris
b134.

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dm-crypt + ZFS on Linux

2012-11-23 Thread Richard Elling
On Nov 23, 2012, at 11:56 AM, Fabian Keil freebsd-lis...@fabiankeil.de wrote:
 
 Just in case your GNU/Linux experiments don't work out, you could
 also try ZFS on Geli on FreeBSD which works reasonably well.
 

For illumos-based distros or Solaris 11, using ZFS with lofi has been
well discussed for many years. Prior to the crypto option being integrated
as a first class citizen in OpenSolaris, the codename used was xlofi, so
try that in your google searches, or look at the man page for lofiadm

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware Recommendations: SAS2 JBODs

2012-11-13 Thread Richard Elling
On Nov 13, 2012, at 12:08 PM, Peter Tripp pe...@psych.columbia.edu wrote:

 Hi folks,
 
 I'm in the market for a couple of JBODs.  Up until now I've been relatively 
 lucky with finding hardware that plays very nicely with ZFS.  All my gear 
 currently in production uses LSI SAS controllers (3801e, 9200-16e, 9211-8i) 
 with backplanes powered by LSI SAS expanders (Sun x4250, Sun J4400, etc).  
 But I'm in the market for SAS2 JBODs to support a large number 3.5inch SAS 
 disks (60+ 3TB disks to start).
 
 I'm aware of potential issues with SATA drives/interposers and the whole SATA 
 Tunneling Protocol (STP) nonsense, so I'm going to stick to a pure SAS setup. 
  Also, since I've had trouble with in the past with daisy-chained SAS JBODs 
 I'll probably stick with one SAS 4x cable (SFF8088) per JBOD and unless there 
 were a compelling reason for multi-pathing I'd probably stick to a single 
 controller.  If possible I'd rather buy 20 packs of enterprise SAS disks with 
 5yr warranties and have the JBOD come with empty trays, but would also 
 consider buying disks with the JBOD if the price wasn't too crazy.
 
 Does anyone have any positive/negative experiences with any of the following 
 with ZFS: 
 * SuperMicro SC826E16-R500LPB (2U 12 drives, dual 500w PS, single LSI SAS2X28 
 expander)
 * SuperMicro SC846BE16-R920B (4U 24 drives, dual 920w PS, single unknown 
 expander)
 * Dell PowerVault MD 1200 (2U 12 drives, dual 600w PS, dual unknown expanders)
 * HP StorageWorks D2600 (2U 12 drives, dual 460w PS, single/dual unknown 
 expanders)

I've used all of the above and all of the DataOn systems, too (Hi Rocky!) 
No real complaints, though as others have noted the supermicro gear
tends to require more work to get going.
 -- richard

 I'm leaning towards the SuperMicro stuff, but every time I order SuperMicro 
 gear there's always something missing or wrongly configured so some of the 
 cost savings gets eaten up with my time figuring out where things went wrong 
 and returning/ordering replacements.  The Dell/HP gear I'm sure is fine, but 
 buying disks from them gets pricey quick. The last time I looked they charged 
 $150 extra per disk for when the only added value was a proprietary sled a 
 shorter warranty (3yr vs 5yr).
 
 I'm open to other JBOD vendors too, was just really just curious what folks 
 were using when they needed more than two dozen 3.5 SAS disks for use with 
 ZFS.
 
 Thanks
 -Peter
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARC de-allocation with large ram

2012-10-22 Thread Richard Elling
On Oct 22, 2012, at 6:52 AM, Chris Nagele nag...@wildbit.com wrote:

 If after it decreases in size it stays there it might be similar to:
 
7111576 arc shrinks in the absence of memory pressure
 
 After it dropped, it did build back up. Today is the first day that
 these servers are working under real production load and it is looking
 much better. arcstat is showing some nice numbers for arc, but l2 is
 still building.
 
 read  hits  miss  hit%  l2read  l2hits  l2miss  l2hit%  arcsz  l2size
 19K   17K  2.5K872.5K 4902.0K  19   148G371G
 41K   39K  2.3K942.3K 1842.1K   7   148G371G
 34K   34K   69498 694  17 677   2   148G371G
 16K   15K  1.0K931.0K  161.0K   1   148G371G
 39K   36K  2.3K942.3K  202.3K   0   148G371G
 23K   22K   74696 746  76 670  10   148G371G
 49K   47K  1.7K961.7K 2491.5K  14   148G371G
 23K   21K  1.4K931.4K  381.4K   2   148G371G
 
 My only guess is that the large zfs send / recv streams were affecting
 the cache when they started and finished.

There are other cases where data is evicted from the ARC, though I don't
have a complete list at my fingertips. For example, if a zvol is closed, then
the data for the zvol is evicted.
 -- richard

 
 Thanks for the responses and help.
 
 Chris
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send to older version

2012-10-22 Thread Richard Elling
On Oct 19, 2012, at 4:59 PM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Richard Elling
 
 At some point, people will bitterly regret some zpool upgrade with no way
 back.
 
 uhm... and how is that different than anything else in the software world?
 
 No attempt at backward compatibility, and no downgrade path, not even by 
 going back to an older snapshot before the upgrade.

ZFS has a stellar record of backwards compatibility. The only break with 
backwards
compatibility I can recall was a bug fix in the send stream somewhere around 
opensolaris b34.

Perhaps you are confusing backwards compatibility with forwards compatibility?
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send to older version

2012-10-19 Thread Richard Elling
On Oct 19, 2012, at 1:04 AM, Michel Jansens michel.jans...@ulb.ac.be wrote:

 On 10/18/12 21:09, Michel Jansens wrote:
 Hi,
 
 I've been using a Solaris 10 update 9 machine for some time to replicate 
 filesystems from different servers through zfs send|ssh zfs receive.
 This was done to store  disaster recovery pools. The DR zpools are made 
 from  sparse files (to allow for easy/efficient backup to tape).
 
 Now I've installed a Solaris 11 machine and a SmartOS one.
 When I try to replicate the pools from those machines, I get an error 
 because filesystem/pool version don't support some features/properties on 
 the solaris 10u9.
 Is there a way (apart from rsync) to send a snapshot from a newer zpool to 
 an older one?
 
 You have to create pools/filesystems with the older versions used by the 
 destination machine.
 
 Thanks Ian,
 
 One thing that is annoying though with running old pool version on Solaris is 
 that zpool status -x doesn't return 'all pools are healthy'.
 And I wonder how SmartOS or Solaris 11 will react with Solaris 10 update 9 
 version filesystem for zones or KVM...
 Also hearing about the new feature flags, I have a feeling that there is a 
 risk of ZFS world being more and more fragmented.

Feature flags offers a sane method to deal with the existing fragmentation.
Everyone will have it, except Oracle Solaris.

 At some point, people will bitterly regret some zpool upgrade with no way 
 back.

uhm... and how is that different than anything else in the software world?

 In that fragmented world, some common exchange (replication) format would be 
 reassuring.
 
 In this respect, I suppose Arne Jansen's zfs fits-send portable streams is 
 good news, though it's write only (to BTRFS), And it looks like a filesystem 
 only feature  (not for volumes) 

FITS is interesting for those file systems that support snapshots. If the market
demands, there could be some interesting work done for interop with ReFS
and others.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] looking for slides for basic zfs intro

2012-10-19 Thread Richard Elling
On Oct 19, 2012, at 6:37 AM, Eugen Leitl eu...@leitl.org wrote:

 Hi,
 
 I would like to give a short talk at my organisation in order
 to sell them on zfs in general, and on zfs-all-in-one and
 zfs as remote backup (zfs send).

Googling will find a few shorter presos. I have full-day presos on
slideshare
http://www.slideshare.net/relling

source available on request.
 -- richard

 
 Does anyone have a short set of presentation slides or maybe 
 a short video I could pillage for that purpose? Thanks.
 
 -- Eugen
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing rpool device paths/drivers

2012-10-19 Thread Richard Elling
On Oct 19, 2012, at 12:16 AM, James C. McPherson j...@opensolaris.org wrote:

 On 19/10/12 04:50 PM, Jim Klimov wrote:
 Hello all,
 
 I have one more thought - or a question - about the current
 strangeness of rpool import: is it supported, or does it work,
 to have rpools on multipathed devices?
 
 If yes (which I hope it is, but don't have a means to check)
 what sort of a string is saved into the pool's labels as its
 device path? Some metadevice which is on a layer above mpxio,
 or one of the physical storage device paths? If the latter is
 the case, what happens during system boot if the multipathing
 happens to choose another path, not the one saved in labels?
 
 if you run /usr/bin/strings over /etc/zfs/zpool.cache,
 you'll see that not only is the device path stored, but
 (more importantly) the devid.

yuk. zdb -C is what you want.

 As far as I'm aware, having an rpool on multipathed devices
 is fine. Multiple paths to the device should still allow ZFS
 to obtain the same devid info... and we use devid's in
 preference to physical paths.

It is fine. The boot process is slightly different in that zpool.cache
is not consulted at first. However, it is consulted later, so there are
edge cases where this can cause problems when there are significant
changes in the device tree. The archives are full of workarounds for 
this rare case.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-12 Thread Richard Elling
On Oct 12, 2012, at 5:50 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 Pedantically, a pool can be made in a file, so it works the same...
 
 Pool can only be made in a file, by a system that is able to create a pool.  

You can't send a pool, you can only send a dataset. Whether you receive the 
dataset
into a pool or file is a minor nit, the send stream itself is consistent.

 Point is, his receiving system runs linux and doesn't have any zfs; his 
 receiving system is remote from his sending system, and it has been suggested 
 that he might consider making an iscsi target available, so the sending 
 system could zpool create and zfs receive directly into a file or device 
 on the receiving system, but it doesn't seem as if that's going to be 
 possible for him - he's expecting to transport the data over ssh.  So he's 
 looking for a way to do a zfs receive on a linux system, transported over 
 ssh.  Suggested answers so far include building a VM on the receiving side, 
 to run openindiana (or whatever) or using zfs-fuse-linux. 
 
 He is currently writing his zfs send datastream into a series of files on 
 the receiving system, but this has a few disadvantages as compared to doing 
 zfs receive on the receiving side.  Namely, increased risk of data loss and 
 less granularity for restores.  For these reasons, it's been suggested to 
 find a way of receiving via zfs receive and he's exploring the 
 possibilities of how to improve upon this situation.  Namely, how to zfs 
 receive on a remote linux system via ssh, instead of cat'ing or redirecting 
 into a series of files.
 
 There, I think I've recapped the whole thread now.   ;-)


Yep, and cat works fine.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] horrible slow pool

2012-10-11 Thread Richard Elling
Hi John,
comment below...

On Oct 11, 2012, at 3:10 AM, Carsten John cj...@mpi-bremen.de wrote:

 Hello everybody,
 
 I just wanted to share my experience with a (partially) broken SSD that was 
 in use in a ZIL mirror.
 
 We experienced a dramatic performance problem with one of our zpools, serving 
 home directories. Mainly NFS clients were affected. Our SunRay infrastructure 
 came to a complete halt.
 
 Finally we were able to identify one SSD as the root caus. The SSD was still 
 working, but quite slow.
 
 The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect 
 it, too.
 
 We identified the broken disk by issuing iostat -en'. After replacing the 
 SSD, everything went back to normal.
 
 To prevent outages like this in the future I hacked together a quick and 
 dirty bash script to detect disks with a given rate of total errors. The 
 script might be used in conjunction with nagios.

This shouldn't be needed. All of the fields of iostat are in kstats and nagios 
can already
collect kstats.
kstat -pm sderr

The good thing about using this method is that it works with or without ZFS.
The bad thing is that some SMART tools and devices trigger complaints that
show up as errors (that can be safely ignored)
 -- richard

 
 Perhaps it's of use for others sa well:
 
 ###
 #!/bin/bash
 # check disk in all pools for errors.
 # partially failing (or slow) disks
 # may result in horribly degradded 
 # performance of zpools despite the fact
 # the pool is still healthy
 
 # exit codes
 # 0 OK
 # 1 WARNING
 # 2 CRITICAL
 # 3 UNKONOWN
 
 OUTPUT=
 WARNING=0
 CRITICAL=0
 SOFTLIMIT=5
 HARDLIMIT=20
 
 LIST=$(zpool status | grep c[1-9].*d0  | awk '{print $1}')
for DISK in $LIST 
do  
ERROR=$(iostat -enr $DISK | cut -d , -f 4 | grep ^[0-9])
if [[ $ERROR -gt $SOFTLIMIT ]]
then
OUTPUT=$OUTPUT, $DISK:$ERROR
WARNING=1
fi
if [[ $ERROR -gt $HARDLIMIT ]]
then
OUTPUT=$OUTPUT, $DISK:$ERROR
CRITICAL=1
fi
done
 
 if [[ $CRITICAL -gt 0 ]]
 then
echo CRITICAL: Disks with error count = $HARDLIMIT found: $OUTPUT
exit 2
 fi
 if [[ $WARNING -gt 0 ]]
 then
echo WARNING: Disks with error count = $SOFTLIMIT found: $OUTPUT
exit 1
 fi
 
 echo OK: No significant disk errors found
 exit 0
 
 ###
 
 
 
 cu
 
 Carsten
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-11 Thread Richard Elling
On Oct 11, 2012, at 6:03 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 Read it again he asked, On that note, is there a minimal user-mode zfs thing
 that would allow
 receiving a stream into an image file?  Something like:
  zfs send ... | ssh user@host cat  file
 
 He didn't say he wanted to cat to a file.  But it doesn't matter.  It was 
 only clear from context, responding to the advice of zfs receiveing into a 
 zpool-in-a-file, that it was clear he was asking about doing a zfs receive 
 into a file, not just cat.  If you weren't paying close attention to the 
 thread, it would be easy to misunderstand what he was asking for.

Pedantically, a pool can be made in a file, so it works the same...

 
 When he asked for minimal user-mode he meant, something less than a 
 full-blown OS installation just for the purpose of zfs receive.  He went on 
 to say, he was considering zfs-fuse-on-linux.

... though I'm not convinced zfs-fuse supports files, whereas illumos/Solaris 
does.
Perhaps a linux fuse person can respond.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-11 Thread Richard Elling
On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom phillip.wagst...@gmail.com 
wrote:

 
 On Oct 11, 2012, at 4:47 PM, andy thomas wrote:
 
 According to a Sun document called something like 'ZFS best practice' I read 
 some time ago, best practice was to use the entire disk for ZFS and not to 
 partition or slice it in any way. Does this advice hold good for FreeBSD as 
 well?
 
   My understanding of the best practice was that with Solaris prior to 
 ZFS, it disabled the volatile disk cache.  

This is not quite correct. If you use the whole disk ZFS will attempt to enable 
the 
write cache. To understand why, remember that UFS (and ext, by default) can die 
a
horrible death (+fsck) if there is a power outage and cached data is not 
flushed to disk.
So by default, Sun shipped some disks with write cache disabled by default. For 
non-Sun
disks, they are most often shipped with write cache enabled and the most 
popular file
systems (NTFS) properly issue cache flush requests as needed (for the same 
reason ZFS
issues cache flush requests).

 With ZFS, the disk cache is used, but after every transaction a cache-flush 
 command is issued to ensure that the data made it the platters.

Write cache is flushed after uberblock updates and for ZIL writes. This is 
important for
uberblock updates, so the uberblock doesn't point to a garbaged MOS. It is 
important
for ZIL writes, because they must be guaranteed written to media before ack.
 -- richard

  If you slice the disk, enabling the disk cache for the whole disk is 
 dangerous because other file systems (meaning UFS) wouldn't do the 
 cache-flush and there was a risk for data-loss should the cache fail due to, 
 say a power outage.
   Can't speak to how BSD deals with the disk cache.
 
 I looked at a server earlier this week that was running FreeBSD 8.0 and had 
 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a 
 spare. Large file I/O throughput was OK but the mail jail it hosted had 
 periods when it was very slow with accessing lots of small files. All three 
 disks (the two in the ZFS mirror plus the spare) had been partitioned with 
 gpart so that partition 1 was a 6 GB swap and partition 2 filled the rest of 
 the disk and had a 'freebsd-zfs' partition on it. It was these second 
 partitions that were part of the mirror.
 
 This doesn't sound like a very good idea to me as surelt disk seeks for swap 
 and for ZFS file I/O are bound to clash. aren't they?
 
   It surely would make a slow, memory starved swapping system even 
 slower.  :)
 
 Another point about the Sun ZFS paper - it mentioned optimum performance 
 would be obtained with RAIDz pools if the number of disks was between 3 and 
 9. So I've always limited my pools to a maximum of 9 active disks plus 
 spares but the other day someone here was talking of seeing hundreds of 
 disks in a single pool! So what is the current advice for ZFS in Solaris and 
 FreeBSD?
 
   That number was drives per vdev, not per pool.
 
 -Phil
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-10 Thread Richard Elling

On Oct 10, 2012, at 9:29 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Richard Elling
 
 If the recipient system doesn't support zfs receive, [...]
 
 On that note, is there a minimal user-mode zfs thing that would allow
 receiving a stream into an image file? No need for file/directory access
 etc.
 
 cat :-)
 
 He was asking if it's possible to do zfs receive on a system that doesn't 
 natively support zfs.  The answer is no, unless you want to consider fuse or 
 similar.

Read it again he asked, On that note, is there a minimal user-mode zfs thing 
that would allow 
receiving a stream into an image file?  Something like:
zfs send ... | ssh user@host cat  file

  I can't speak about zfs on fuse or anything - except that I personally 
 wouldn't trust it.  There are differences even between zfs on solaris versus 
 freebsd, vs whatever, all of which are fully supported, much better than zfs 
 on fuse.  But different people use and swear by all of these things - so 
 maybe it would actually be a good solution for you.
 
 The direction I would personally go would be an openindiana virtual machine 
 to do the zfs receive.
 
 
 I was thinking maybe the zfs-fuse-on-linux project may have suitable bits?
 
 I'm sure most Linux distros have cat
 
 hehe.  Anyway.  Answered above.
 


 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question

2012-10-07 Thread Richard Elling
On Oct 7, 2012, at 3:50 PM, Johannes Totz johan...@jo-t.de wrote:

 On 05/10/2012 15:01, Edward Ned Harvey
 (opensolarisisdeadlongliveopensolaris) wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- 
 boun...@opensolaris.org] On Behalf Of Tiernan OToole
 
 I am in the process of planning a system which will have 2 ZFS 
 servers, one on site, one off site. The on site server will be
 used by workstations and servers in house, and most of that will
 stay in house. There will, however, be data i want backed up
 somewhere else, which is where the offsite server comes in... This
 server will be sitting in a Data Center and will have some storage 
 available to it (the whole server currently has 2 3Tb drives, 
 though they are not dedicated to the ZFS box, they are on VMware 
 ESXi). There is then some storage (currently 100Gb, but more can
 be requested) of SFTP enabled backup which i plan to use for some 
 snapshots, but more on that later.
 
 Anyway, i want to confirm my plan and make sure i am not missing 
 anything here...
 
 * build server in house with storage, pools, etc... * have a
 server in data center with enough storage for its reason, plus the
 extra for offsite backup * have one pool set as my offsite
 pool... anything in here should be backed up off site also... *
 possibly have another set as very offsite which will also be
 pushed to the SFTP server, but not sure... * give these pools out
 via SMB/NFS/iSCSI * every 6 or so hours take a snapshot of the 2 
 offsite pools. * do a ZFS send to the data center box * nightly,
 on the very offsite pool, do a ZFS send to the SFTP server * if 
 anything goes wrong (my server dies, DC server dies, etc), Panic, 
 download, pray... the usual... :)
 
 Anyway, I want to make sure i am doing this correctly... Is there 
 anything on that list that sounds stupid or am i doing anything 
 wrong? am i missing anything?
 
 Also, as a follow up question, but slightly unrelated, when it 
 comes to the ZFS Send, i could use SSH to do the send, directly to 
 the machine... Or i could upload the compressed, and possibly 
 encrypted dump to the server... Which, for resume-ability and 
 speed, would be suggested? And if i where to go with an upload 
 option, any suggestions on what i should use?
 
 It is recommended, whenever possible, you should pipe the zfs send 
 directly into a zfs receive on the receiving system.  For two
 solid reasons:
 
 If a single bit is corrupted, the whole stream checksum is wrong and 
 therefore the whole stream is rejected.  So if this occurs, you want 
 to detect it (in the form of one incremental failed) and then
 correct it (in the form of the next incremental succeeding).
 Whereas, if you store your streams on storage, it will go undetected,
 and everything after that point will be broken.
 
 If you need to do a restore, from a stream stored on storage, then 
 your only choice is to restore the whole stream.  You cannot look 
 inside and just get one file.  But if you had been doing send | 
 receive, then you obviously can look inside the receiving filesystem 
 and extract some individual specifics.
 
 If the recipient system doesn't support zfs receive, [...]
 
 On that note, is there a minimal user-mode zfs thing that would allow
 receiving a stream into an image file? No need for file/directory access
 etc.

cat :-)

 I was thinking maybe the zfs-fuse-on-linux project may have suitable bits?

I'm sure most Linux distros have cat
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How many disk in one pool

2012-10-05 Thread Richard Elling
On Oct 5, 2012, at 1:57 PM, Albert Shih albert.s...@obspm.fr wrote:

 Hi all,
 
 I'm actually running ZFS under FreeBSD. I've a question about how many
 disks I «can» have in one pool. 
 
 At this moment I'm running with one server (FreeBSD 9.0) with 4 MD1200
 (Dell) meaning 48 disks. I've configure with 4 raidz2 in the pool (one on
 each MD1200)
 
 On what I understand I can add more more MD1200. But if I loose one MD1200
 for any reason I lost the entire pool. 
 
 In your experience what's the «limit» ? 100 disk ? 

I can't speak for current FreeBSD, but I've seen more than 400
disks (HDDs) in a single pool.

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'

2012-10-04 Thread Richard Elling
On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote:

 Hi,
 
 I have a machine whose zpools are at version 28, and I would like to
 keep them at that version for portability between OSes. I understand
 that 'zpool status' asks me to upgrade, but so does 'zpool status -x'
 (the man page says it should only report errors or unavailability).
 This is a problem because I have a script that assumes zpool status
 -x only returns errors requiring user intervention.

The return code for zpool is ambiguous. Do not rely upon it to determine
if the pool is healthy. You should check the health property instead.

 Is there a way to either:
 A) suppress the upgrade notice from 'zpool status -x' ?

Pedantic answer, it is open source ;-)

 B) use a different command to get information about actual errors
 w/out encountering the upgrade notice ?
 
 I'm using OpenIndiana 151a6 on x86.


 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-10-04 Thread Richard Elling
On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber dswa...@druber.com wrote:

 On 10/4/2012 11:48 AM, Richard Elling wrote:
 
 On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber dswa...@druber.com wrote:
 
 
 This whole thread has been fascinating.  I really wish we (OI) had the two 
 following things that freebsd supports:
 
 1. HAST - provides a block-level driver that mirrors a local disk to a 
 network disk presenting the result as a block device using the GEOM API.
 
 This is called AVS in the Solaris world.
 
 In general, these systems suffer from a fatal design flaw: the authoritative 
 view of the 
 data is not also responsible for the replication. In other words, you can 
 provide coherency
 but not consistency. Both are required to provide a single view of the data.
 
 Can you expand on this?

I could, but I've already written a book on clustering. For a more general 
approach
to understanding clustering, I can highly recommend Pfister's In Search of 
Clusters.
http://www.amazon.com/In-Search-Clusters-2nd-Edition/dp/0138997098

NB, clustered storage is the same problem as clustered compute wrt state.

 2. CARP.
 
 This exists as part of the OHAC project.
  -- richard
 
 
 These are both freely available?

Yes.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling
Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...

On Oct 4, 2012, at 8:31 AM, Neil Perrin neil.per...@oracle.com wrote:

 On 10/04/12 05:30, Schweiss, Chip wrote:
 
 Thanks for all the input.  It seems information on the performance of the 
 ZIL is sparse and scattered.   I've spent significant time researching this 
 the past day.  I'll summarize what I've found.   Please correct me if I'm 
 wrong.
 The ZIL can have any number of SSDs attached either mirror or individually.  
  ZFS will stripe across these in a raid0 or raid10 fashion depending on how 
 you configure.
 
 The ZIL code chains blocks together and these are allocated round robin among 
 slogs or
 if they don't exist then the main pool devices.
 
 To determine the true maximum streaming performance of the ZIL setting 
 sync=disabled will only use the in RAM ZIL.   This gives up power protection 
 to synchronous writes.
 
 There is no RAM ZIL. If sync=disabled then all writes are asynchronous and 
 are written
 as part of the periodic ZFS transaction group (txg) commit that occurs every 
 5 seconds.
 
 Many SSDs do not help protect against power failure because they have their 
 own ram cache for writes.  This effectively makes the SSD useless for this 
 purpose and potentially introduces a false sense of security.  (These SSDs 
 are fine for L2ARC)
 
 The ZIL code issues a write cache flush to all devices it has written before 
 returning
 from the system call. I've heard, that not all devices obey the flush but we 
 consider them
 as broken hardware. I don't have a list to avoid.
 
 
 Mirroring SSDs is only helpful if one SSD fails at the time of a power 
 failure.  This leave several unanswered questions.  How good is ZFS at 
 detecting that an SSD is no longer a reliable write target?   The chance of 
 silent data corruption is well documented about spinning disks.  What chance 
 of data corruption does this introduce with up to 10 seconds of data written 
 on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is 
 returning what we write to it?
 
 If the ZIL code gets a block write failure it will force the txg to commit 
 before returning.
 It will depend on the drivers and IO subsystem as to how hard it tries to 
 write the block.
 
 
 Zpool versions 19 and higher should be able to survive a ZIL failure only 
 loosing the uncommitted data.   However, I haven't seen good enough 
 information that I would necessarily trust this yet. 
 
 This has been available for quite a while and I haven't heard of any bugs in 
 this area.
 
 Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs.   
 I'm not sure if that is current, but I can't find any reports of better 
 performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push 
 past this.
 
 1GB/s seems very high, but I don't have any numbers to share.

It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
 -- richard

 
   
 Anyone care to post their performance numbers on current hardware with E5 
 processors, and ram based ZIL solutions?  
 
 Thanks to everyone who has responded and contacted me directly on this issue.
 
 -Chip
 On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
 andrew.gabr...@cucumber.demon.co.uk wrote:
 Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL 
 to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
 
 Temporarily set sync=disabled
 Or, depending on your application, leave it that way permanently.  I know, 
 for the work I do, most systems I support at most locations have 
 sync=disabled.  It all depends on the workload.
 
 Noting of course that this means that in the case of an unexpected system 
 outage or loss of connectivity to the disks, synchronous writes since the 
 last txg commit will be lost, even though the applications will believe they 
 are secured to disk. (ZFS filesystem won't be corrupted, but it will look 
 like it's been wound back by up to 30 seconds when you reboot.)
 
 This is fine for some workloads, such as those where you would start again 
 with fresh data and those which can look closely at the data to see how far 
 they got before being rudely interrupted, but not for those which rely on 
 the Posix semantics of synchronous writes/syncs meaning data is secured on 
 non-volatile storage when the function returns.
 
 -- 
 Andrew
 
 
 
 

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling

On Oct 4, 2012, at 1:33 PM, Schweiss, Chip c...@innovates.com wrote:

 Again thanks for the input and clarifications.
 
 I would like to clarify the numbers I was talking about with ZiL performance 
 specs I was seeing talked about on other forums.   Right now I'm getting 
 streaming performance of sync writes at about 1 Gbit/S.   My target is closer 
 to 10Gbit/S.   If I get to build it this system, it will house a decent size 
 VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.   
 This is all medical imaging research.  We move data around by the TB and fast 
 streaming is imperative.  
 
 On the system I've been testing with is 10Gbe connected and I have about 50 
 VMs running very happily, and haven't yet found my random I/O limit. However 
 every time, I storage vMotion a handful of additional VMs, the ZIL seems to 
 max out it's writing speed to the SSDs and random I/O also suffers.   With 
 out the SSD ZIL, random I/O is very poor.   I will be doing some testing with 
 sync=off, tomorrow and see how things perform.
 
 If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more 
 streaming synchronous writes please let me know.  

Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3 
GBytes/sec of 
storage vmotion on a relatively modest system. To sustain that sort of thing 
often requires
full system-level tuning and proper systems engineering design. Fortunately, 
people 
tend to not do storage vmotion on a continuous basis.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] reminder: ZFS day next Tuesday

2012-09-27 Thread Richard Elling
If you've been hiding under a rock, not checking your email, then you might
not have heard about the Next Big Whopper Event for ZFS Fans: ZFS Day!
The agenda is now set and the teams are preparing to descend towards San 
Francisco's Moscone Center vortex for a full day of ZFS. I'd love to see y'all 
there in person, but if you can't make it, be sure to register for the streaming
video feeds. Details at:
www.zfsday.com

Be sure to prep your ZFS war stories for the beer bash afterwards -- thanks
Delphix!
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-09-26 Thread Richard Elling
On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 Here's another one.
  
 Two identical servers are sitting side by side.  They could be connected to 
 each other via anything (presently using crossover ethernet cable.)  And 
 obviously they both connect to the regular LAN.  You want to serve VM's from 
 at least one of them, and even if the VM's aren't fault tolerant, you want at 
 least the storage to be live synced.  The first obvious thing to do is simply 
 cron a zfs send | zfs receive at a very frequent interval.  But there are a 
 lot of downsides to that - besides the fact that you have to settle for some 
 granularity, you also have a script on one system that will clobber the other 
 system. So in the event of a failure, you might promote the backup into 
 production, and you have to be careful not to let it get clobbered when the 
 main server comes up again.
  
 I like much better, the idea of using a zfs mirror between the two systems.  
 Even if it comes with a performance penalty, as a result of bottlenecking the 
 storage onto Ethernet.  But there are several ways to possibly do that, and 
 I'm wondering which will be best.
  
 Option 1:  Each system creates a big zpool of the local storage.  Then, 
 create a zvol within the zpool, and export it iscsi to the other system.  Now 
 both systems can see a local zvol, and a remote zvol, which it can use to 
 create a zpool mirror.  The reasons I don't like this idea are because it's a 
 zpoolwithin a zpool, including the double-checksumming and everything.  But 
 the double-checksummingisn't such a concern to me - I'm mostly afraid some 
 horrible performance or reliability problem might be resultant.  Naturally, 
 you would only zpool import the nested zpool on one system.  The other system 
 would basically just ignore it.  But in the event of a primary failure, you 
 could force import the nested zpool on the secondary system.

This was described by Thorsten a few years ago.
http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf

IMHO, the issues are operational: troubleshooting could be very challenging.

  
 Option 2:  At present, both systems are using local mirroring ,3 mirror pairs 
 of 6 disks.  I could break these mirrors, and export one side over to the 
 other system...  And vice versa.  So neither server will be doing local 
 mirroring; they will both be mirroring across iscsi to targets on the other 
 host.  Once again, each zpool will only be imported on one host, but in the 
 event of a failure, you could force import it on the other host.
  
 Can anybody think of a reason why Option 2 would be stupid, or can you think 
 of a better solution?

If they are close enough for crossover cable where the cable is UTP, then 
they are 
close enough for SAS.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting question about L2ARC

2012-09-26 Thread Richard Elling

On Sep 26, 2012, at 4:28 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:

 On 09/26/2012 01:14 PM, Edward Ned Harvey
 (opensolarisisdeadlongliveopensolaris) wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 Got me wondering: how many reads of a block from spinning rust
 suffice for it to ultimately get into L2ARC? Just one so it
 gets into a recent-read list of the ARC and then expires into
 L2ARC when ARC RAM is more needed for something else, 
 
 Correct, but not always sufficient.  I forget the name of the parameter, but 
 there's some rate limiting thing that limits how fast you can fill the 
 L2ARC.  This means sometimes, things will expire from ARC, and simply get 
 discarded.
 
 The parameters are:
 
 *) l2arc_write_max (default 8MB): max number of bytes written per
fill cycle

It should be noted that this level was perhaps appropriate 6 years
ago, when L2ARC was integrated and given the SSDs available at the
time, but is well below reasonable settings for high speed systems or
modern SSDs. It is probably not a bad idea to change the default to 
reflect more modern systems, thus avoiding surprises.
 -- richard

 *) l2arc_headroom (default 2x): multiplies the above parameter and
determines how far into the ARC lists we will search for buffers
eligible for writing to L2ARC.
 *) l2arc_feed_secs (default 1s): regular interval between fill cycles
 *) l2arc_feed_min_ms (default 200ms): minimum interval between fill
cycles
 
 Cheers,
 --
 Saso
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cold failover of COMSTAR iSCSI targets on shared storage

2012-09-25 Thread Richard Elling
On Sep 25, 2012, at 12:30 PM, Jim Klimov jimkli...@cos.ru wrote:

 Hello all,
 
  With original old ZFS iSCSI implementation there was
 a shareiscsi property for the zvols to be shared out,
 and I believe all configuration pertinent to the iSCSI
 server was stored in the pool options (I may be wrong,
 but I'd expect that given that ZFS-attribute-based
 configs were deigned to atomically import and share
 pools over various protocols like CIFS and NFS).
 
  With COMSTAR which is more advanced and performant,
 all configs seem to be in the OS config files and/or
 SMF service properties - not in the pool in question.
 
  Does this mean that importing a pool with iSCSI zvols
 on a fresh host (LiveCD instance on the same box, or
 via failover of shared storage to a different host)
 will not be able to automagically share the iSCSI
 targets the same way as they were known in the initial
 OS that created and shared them - not until an admin
 defines the same LUNs and WWN numbers and such, manually?
 
  Is this a correct understanding (and does the problem
 exist indeed), or do I (hopefully) miss something?

That is pretty much how it works, with one small wrinkle -- the
configuration is stored in SMF. So you can either do it the hard
way (by hand), use a commercially-available HA solution
(eg. RSF-1 from high-availability.com), or use SMF export/import.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.

2012-09-25 Thread Richard Elling
On Sep 25, 2012, at 11:17 AM, Jason Usher jushe...@yahoo.com wrote:
 
 Ok - but from a performance point of view, I am only using
 ram/cpu resources for the deduping of just the individual
 filesystems I enabled dedupe on, right ?  I hope that
 turning on dedupe for just one filesystem did not incur
 ram/cpu costs across the entire pool...
 
 It depends. -- richard
 
 
 
 
 Can you elaborate at all ?  Dedupe can have fairly profound performance 
 implications, and I'd like to know if I am paying a huge price just to get a 
 dedupe on one little filesystem ...

The short answer is: deduplication transforms big I/Os into small I/Os, 
but does not eliminate I/O. The reason is that the deduplication table has
to be updated when you write something that is deduplicated. This implies
that storage devices which are inexpensive in $/GB but expensive in $/IOPS
might not be the best candidates for deduplication (eg. HDDs). There is some
additional CPU overhead for the sha-256 hash that might or might not be 
noticeable, depending on your CPU. But perhaps the most important factor
is your data -- is it dedupable and are the space savings worthwhile? There
is no simple answer for that, but we generally recommend that you simulate
dedup before committing to it.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cold failover of COMSTAR iSCSI targets on shared storage

2012-09-25 Thread Richard Elling
On Sep 25, 2012, at 1:32 PM, Jim Klimov jimkli...@cos.ru wrote:

 2012-09-26 0:21, Richard Elling пишет:
 Does this mean that importing a pool with iSCSI zvols
 on a fresh host (LiveCD instance on the same box, or
 via failover of shared storage to a different host)
 will not be able to automagically share the iSCSI
 targets the same way as they were known in the initial
 OS that created and shared them - not until an admin
 defines the same LUNs and WWN numbers and such, manually?
 
 Is this a correct understanding (and does the problem
 exist indeed), or do I (hopefully) miss something?
 
 That is pretty much how it works, with one small wrinkle -- the
 configuration is stored in SMF. So you can either do it the hard
 way (by hand), use a commercially-available HA solution
 (eg. RSF-1 from high-availability.com http://high-availability.com),
 or use SMF export/import.
  -- richard
 
 So if I wanted to make a solution where upon import of
 the pool with COMSTAR shared zvols, the new host is able
 to publish the same resources as the previous holder of
 the pool media, could I get away with some scripts (on
 all COMSTAR servers involved) which would:
 
 1) Regularly svccfg export certain SMF service configs
   to a filesystem dataset on the pool in question.

This is only needed when you add a new COMSTAR share.
You will also need to remove old ones. Fortunately, you have a 
pool where you can store these :-)

 2) Upon import of the pool, such scripts would svccfg
   import the SMF setup, svcadm refresh and maybe
   svcadm restart (or svcadm enable) the iSCSI SMF
   services and thus share the same zvols with same
   settings?

Import should suffice.

 Is this a correct understanding of doing shareiscsi
 for COMSTAR in the poor-man's HA setup? ;)

Yes.

 Apparently, to be transparent for clients, this would
 also use VRRP or something like that to carry over the
 iSCSI targets' IP address(es), separate from general
 communications addressing of the hosts (the addressing
 info might also be in same dataset as SMF exports).

Or just add another IP address. This is how HA systems work.

 Q: Which services are the complete list needed to
   set up the COMSTAR server from scratch?

Dunno off the top of my head. Network isn't needed (COMSTAR
can serve FC), but you can look at the SMF configs for details.

I haven't looked at the OHAC agents in a long, long time, but you
might find some scripts already built there.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.

2012-09-25 Thread Richard Elling

On Sep 25, 2012, at 1:46 PM, Jim Klimov jimkli...@cos.ru wrote:

 2012-09-24 21:08, Jason Usher wrote:
 Ok, thank you.  The problem with this is, the
 compressratio only goes to two significant digits, which
 means if I do the math, I'm only getting an
 approximation.  Since we may use these numbers to
 compute billing, it is important to get it right.
 
 Is there any way at all to get the real *exact* number ?
 
 Well, if you take into account snapshots and clones,
 you can see really small used numbers on datasets
 which reference a lot of data.
 
 In fact, for accounting you might be better off with
 the referenced field instead of used, but note
 that it is not recursive and you need to account
 each child dataset's byte references separately.
 
 I am not sure if there is a simple way to get exact
 byte-counts instead of roundings like 422M...

zfs get -p
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.

2012-09-24 Thread Richard Elling
On Sep 24, 2012, at 10:08 AM, Jason Usher jushe...@yahoo.com wrote:

 Oh, and one other thing ...
 
 
 --- On Fri, 9/21/12, Jason Usher jushe...@yahoo.com wrote:
 
 It shows the allocated number of bytes used by the
 filesystem, i.e.
 after compression. To get the uncompressed size,
 multiply
 used by
 compressratio (so for example if used=65G and
 compressratio=2.00x,
 then your decompressed size is 2.00 x 65G = 130G).
 
 
 Ok, thank you.  The problem with this is, the
 compressratio only goes to two significant digits, which
 means if I do the math, I'm only getting an
 approximation.  Since we may use these numbers to
 compute billing, it is important to get it right.
 
 Is there any way at all to get the real *exact* number ?
 
 
 I'm hoping the answer is yes - I've been looking but do not see it ...

none can hide from dtrace!
# dtrace -qn 'dsl_dataset_stats:entry {this-ds = (dsl_dataset_t 
*)arg0;printf(%s\tcompressed size = %d\tuncompressed size=%d\n, 
this-ds-ds_dir-dd_myname, this-ds-ds_phys-ds_compressed_bytes, 
this-ds-ds_phys-ds_uncompressed_bytes)}'
openindiana-1   compressed size = 3667988992uncompressed size=3759321088

[zfs get all rpool/openindiana-1 in another shell]

For reporting, the number is rounded to 2 decimal places.

 Ok.  So the dedupratio I see for the entire pool is
 dedupe ratio for filesystems in this pool that have dedupe
 enabled ... yes ?
 
 
 Also, why do I not see any dedupe stats for the
 individual filesystem ?  I see compressratio, and I
 see
 dedup=on, but I don't see any dedupratio for the
 filesystem
 itself...
 
 
 Ok, getting back to precise accounting ... if I turn on
 dedupe for a particular filesystem, and then I multiply the
 used property by the compressratio property, and calculate
 the real usage, do I need to do another calculation to
 account for the deduplication ?  Or does the used
 property not take into account deduping ?
 
 
 So if the answer to this is yes, the used property is not only a compressed 
 figure, but a deduped figure then I think we have a bigger problem ...
 
 You described dedupe as operating not only within the filesystem with 
 dedup=on, but between all filesystems with dedupe enabled.
 
 Doesn't that mean that if I enabled dedupe on more than one filesystem, I can 
 never know how much total, raw space each of those is using ?  Because if the 
 dedupe ratio is calculated across all of them, it's not the actual ratio for 
 any one of them ... so even if I do the math, I can't decide what the total 
 raw usage for one of them is ... right ?

Correct. This is by design so that blocks shared amongst different datasets can
be deduped -- the common case for things like virtual machine images.

 
 Again, if used does not reflect dedupe, and I don't need to do any math to 
 get the raw storage figure, then it doesn't matter...
 
 
 
 Did turning on dedupe for a single filesystem turn
 it
 on for the entire pool ?
 
 In a sense, yes. The dedup machinery is pool-wide, but
 only
 writes from
 filesystems which have dedup enabled enter it. The
 rest
 simply pass it
 by and work as usual.
 
 
 Ok - but from a performance point of view, I am only using
 ram/cpu resources for the deduping of just the individual
 filesystems I enabled dedupe on, right ?  I hope that
 turning on dedupe for just one filesystem did not incur
 ram/cpu costs across the entire pool...
 
 
 I also wonder about this performance question...

It depends.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question about ZFS snapshots

2012-09-21 Thread Richard Elling
On Sep 20, 2012, at 10:05 PM, Stefan Ring stefan...@gmail.com wrote:

 On Fri, Sep 21, 2012 at 6:31 AM, andy thomas a...@time-domain.co.uk wrote:
 I have a ZFS filseystem and create weekly snapshots over a period of 5 weeks
 called week01, week02, week03, week04 and week05 respectively. Ny question
 is: how do the snapshots relate to each other - does week03 contain the
 changes made since week02 or does it contain all the changes made since the
 first snapshot, week01, and therefore includes those in week02?
 
 Every snapshot is based on the previous one and store only what is
 needed to capture the differences.

This is not correct. Every snapshot is a complete point-in-time view of the 
dataset. 

You can send differences between snapshots that can be received, thus satisfying
a requirement for incremental replication. Internally, this is easy to do 
because the
birth order (in time) of a block is recorded in the metadata.

 To rollback to week03, it's necesaary to delete snapshots week04 and week05
 first but what if week01 and week02 have also been deleted - will the
 rollback still work or is it ncessary to keep earlier snapshots?
 
 No, it's not necessary. You can rollback to any snapshot.
 
 I almost never use rollback though, in normal use. If I've
 accidentally deleted or overwritten something, I just rsync it over
 from the corresponding /.zfs/snapshots directory. Only if what I want
 to restore is huge, rollback might be a better option.

Yes, rollback is not used very frequently. It is more common to copy out or 
clone the older snapshot. For example, you can clone week03, creating 
what is essentially a fork.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Selective zfs list

2012-09-21 Thread Richard Elling
Hi Bogdan,

On Sep 21, 2012, at 4:00 AM, Bogdan Ćulibrk b...@default.rs wrote:

 Greetings,
 
 I'm trying to achieve selective output of zfs list command for specific 
 user to show only delegated sets. Anyone knows how to achieve this?

There are several ways, but no builtin way, today. Can you provide a use case 
for
how you want this to work? We might want to create an RFE here :-)
 -- richard

 I've checked zfs allow already but it only helps in restricting the user to 
 create, destroy, etc something. There is no permission subcommand for listing 
 or displaying sets.
 
 I'm on oi_151a3 bits.
 

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] all in one server

2012-09-18 Thread Richard Elling
On Sep 18, 2012, at 7:31 AM, Eugen Leitl eu...@leitl.org wrote:
 
 Can I actually have a year's worth of snapshots in
 zfs without too much performance degradation?


I've got 6 years of snapshots with no degradation :-)
In general, there is not a direct correlation between snapshot count and
performance.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zvol vs zfs send/zfs receive

2012-09-16 Thread Richard Elling
On Sep 15, 2012, at 6:03 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Sat, 15 Sep 2012, Dave Pooser wrote:
 
  The problem: so far the send/recv appears to have copied 6.25TB of 
 5.34TB.
 That... doesn't look right. (Comparing zfs list -t snapshot and looking at
 the 5.34 ref for the snapshot vs zfs list on the new system and looking at
 space used.)
 Is this a problem? Should I be panicking yet?
 
 Does the old pool use 512 byte sectors while the new pool uses 4K sectors?  
 Is there any change to compression settings?
 
 With volblocksize of 8k on disks with 4K sectors one might expect very poor 
 space utilization because metadata chunks will use/waste a minimum of 4k.  
 There might be more space consumed by the metadata than the actual data.

With a zvol of 8K blocksize, 4K sector disks, and raidz you will get 12K (data
plus parity) written for every block, regardless of how many disks are in the 
set.
There will also be some metadata overhead, but I don't know of a metadata
sizing formula for the general case.

So the bad news is, 4K sector disks with small blocksize zvols tend to
have space utilization more like mirroring. The good news is that performance
is also more like mirroring.
 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scripting incremental replication data streams

2012-09-12 Thread Richard Elling
On Sep 12, 2012, at 12:44 PM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 I send a replication data stream from one host to another. (and receive).
 I discovered that after receiving, I need to remove the auto-snapshot 
 property on the receiving side, and set the readonly property on the 
 receiving side, to prevent accidental changes (including auto-snapshots.)
  
 Question #1:  Actually, do I need to remove the auto-snapshot on the 
 receiving side?  

Yes

 Or is it sufficient to simply set the readonly property?  

No

 Will the readonly property prevent auto-snapshots from occurring?

No

  
 So then, sometime later, I want to send an incremental replication stream.  I 
 need to name an incremental source snap on the sending side...  which needs 
 to be the latest matching snap that exists on both sides.
  
 Question #2:  What's the best way to find the latest matching snap on both 
 the source and destination?  At present, it seems, I'll have to build a list 
 of sender snaps, and a list of receiver snaps, and parse and search them, 
 till I find the latest one that exists in both.  For shell scripting, this is 
 very non-trivial.

Actually, it is quite easy. You will notice that zfs list -t snapshot shows 
the list in
creation time order. If you are more paranoid, you can get the snapshot's 
creation time from the creation property. For convenience, zfs get -p 
creation ...
will return the time as a number. Something like this:
for i in $(zfs list -t snapshot -H -o name); do echo $(zfs get -p -H -o value 
creation $i) $i; done | sort -n

 -- richard

--
illumos Day  ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot used space question

2012-08-30 Thread Richard Elling
For illumos-based distributions, there is a written and written@ property 
that shows the 
amount of data writtent to each snapshot. This helps to clear the confusion 
over the way
the used property is accounted.
https://www.illumos.org/issues/1645

 -- richard

On Aug 29, 2012, at 11:12 AM, Truhn, Chad chad.tr...@bowheadsupport.com 
wrote:

 All,
 
 I apologize in advance for what appears to be a question asked quite often, 
 but I am not sure I have ever seen an answer that explains it.  This may also 
 be a bit long-winded so I apologize for that as well.
 
 I would like to know how much unique space each individual snapshot is using.
 
 I have a ZFS filesystem that shows:
 
 $zfs list -o space rootpool/export/home
 NAME  AVAIL   USED  USEDSNAP  USEDDS  
 USEDREFRESERV  USEDCHILD
 rootpool/export/home  5.81G   14.4G  8.81G5.54G  0
 0
 
 So reading this I see that I have a total of 14.4G of space used by this data 
 set.  Currently 5.54 is active data that is available on the normal 
 filesystem and 8.81G used in snapshots.  8.81G + 5.54G = 14.4G (roughly).   I 
 100% agree with these numbers and the world makes sense.
 
 This is also backed up by:
 
 $zfs get usedbysnapshots rootpool/export/home
 NAME PROPERTYVALUE 
 SOURCE
 rootpool/export/home  usedbysnapshots 8.81G  -
 
 
 Now if I wanted to see how much space any individual snapshot is currently 
 using I would like to think that this would show me:
 
 $zfs list -ro space rootpool/export/home
 
 NAME  AVAIL   USED  
 USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
 rootpool/export/home   5.81G  14.4G 8.81G   5.54G 
  0  0
 rootpool/export/home@week3  -202M -   -   
-  -
 rootpool/export/home@week2  -104M -   -   
-  -
 rootpool/export/home@7daysago-1.37M -   - 
  -  -
 rootpool/export/home@6daysago-1.20M -   - 
  -  -
 rootpool/export/home@5daysago-1020K -   - 
  -  -
 rootpool/export/home@4daysago-342K -   -  
 -  -
 rootpool/export/home@3daysago-1.28M -   - 
  -  -
 rootpool/export/home@week1  -0-   -   
-  -
 rootpool/export/home@2daysago-0-   -  
 -  -
 rootpool/export/home@yesterday   -   360K -   -   
-  -
 rootpool/export/home@today-1.26M -   -
   -  -
 
 
 So normal logic would tell me if USEDSNAP is 8.81G and is composed of 11 
 snapshots, I would add up the size of each of those snapshots and that would 
 roughly equal 8.81G.  So time to break out the calculator:
 
 202M + 104M + 1.37M + 1.20M + 1020K + 342K + 1.28M +0 +0 + 360K + 1.26M
 equals...  ~312M!
 
 That is nowhere near 8.81G.  I would accept it even if it was within 15%, but 
 it's not even close.  That definitely not metadata or ZFS overhead or 
 anything.
 
 I understand that snapshots are just the delta between the time when the 
 snapshot was taken and the current active filesystem and are truly just 
 references to a block on disk rather than a copy.  I also understand how 
 two (or more) snapshots can reference the same block on a disk but yet there 
 is still only that one block used.  If I delete a recent snapshot I may not 
 save as much space as advertised because some may be inherited by a parent 
 snapshot.  But that inheritance is not creating duplicate used space on disk 
 so it doesn't justify the huge difference in sizes. 
 
 But even with this logic in place there is currently 8.81G of blocks referred 
 to by snapshots which are not currently on the active filesystem and I 
 don't believe anyone can argue with that.  Can something show me how much 
 space a single snapshot has reserved?
 
 I searched through some of the archives and found this thread 
 (http://mail.opensolaris.org/pipermail/zfs-discuss/2012-August/052163.html) 
 from early this month and I feel as if I have the same problem as the OP, but 
 hopefully attacking it with a little more background.  I am not arguing with 
 discrepancies between df/du and zfs output and I have read the Oracle 
 documentation about it but haven't found what I feel like should be a simple 
 answer.  I currently have a ticket open with Oracle, but I am getting answers 
 to all kinds of questions except for the question I am asking so I am hoping 
 someone out there might be able to help me.
 
 I am a little concerned I am going 

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Richard Elling

On Aug 24, 2012, at 6:50 AM, Sašo Kiselkov wrote:

 This is something I've been looking into in the code and my take on your
 proposed points this:
 
 1) This requires many and deep changes across much of ZFS's architecture
 (especially the ability to sustain tlvdev failures).
 
 2) Most of this can be achieved (except for cache persistency) by
 implementing ARC space reservations for certain types of data.

I think the simple solution of increasing default metadata limit above 1/4 of
arc_max will take care of the vast majority of small system complaints. The 
limit is arbitrary and set well before dedupe was delivered.

 
 The latter has the added benefit of spreading load across all ARC and
 L2ARC resources, so your metaxel device never becomes the sole
 bottleneck and it better embraces the ZFS design philosophy of pooled
 storage.
 
 I plan on having a look at implementing cache management policies (which
 would allow for tuning space reservations for metadata/etc. in a
 fine-grained manner without the cruft of having to worry about physical
 cache devices as well).
 
 Cheers,
 --
 Saso
 
 On 08/24/2012 03:39 PM, Jim Klimov wrote:
 Hello all,
 
  The idea of dedicated metadata devices (likely SSDs) for ZFS
 has been generically discussed a number of times on this list,
 but I don't think I've seen a final proposal that someone would
 take up for implementation (as a public source code, at least).
 
  I'd like to take a liberty of summarizing the ideas I've either
 seen in discussions or proposed myself on this matter, to see if
 the overall idea would make sense to gurus of ZFS architecture.
 
  So, the assumption was that the performance killer in ZFS at
 least on smallish deployments (few HDDs and an SSD accelerator),
 like those in Home-NAS types of boxes, was random IO to lots of
 metadata.

It is a bad idea to make massive investments in development and 
testing because of an assumption. Build test cases, prove that the
benefits of the investment can outweigh other alternatives, and then
deliver code.
 -- richard

 This IMHO includes primarily the block pointer tree
 and the DDT for those who risked using dedup. I am not sure how
 frequent is the required read access to other types of metadata
 (like dataset descriptors, etc.) that the occasional reading and
 caching won't solve.
 
  Another idea was that L2ARC caching might not really cut it
 for metadata in comparison to a dedicated metadata storage,
 partly due to the L2ARC becoming empty upon every export/import
 (boot) and needing to get re-heated.
 
  So, here go the highlights of proposal (up for discussion).
 
 In short, the idea is to use today's format of the blkptr_t
 which by default allows to store up to 3 DVA addresses of the
 block, and many types of metadata use only 2 copies (at least
 by default). This new feature adds a specially processed
 TLVDEV in the common DVA address space of the pool, and
 enforces storage of added third copies for certain types
 of metadata blocks on these devices. (Limited) Backwards
 compatibility is quite possible, on-disk format change may
 be not required. The proposal also addresses some questions
 that arose in previous discussions, especially about proposals
 where SSDs would be the only storage for pool's metadata:
 * What if the dedicated metadata device overflows?
 * What if the dedicated metadata device breaks?
 = okay/expected by design, nothing dies.
 
  In more detail:
 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
   cache and log - say, metaxel for metadata accelerator?),
   and allow (even encourage) use of mirrored devices and allow
   expansion (raid0, raid10 and/or separate TLVDEVs) with added
   singlets/mirrors of such devices.
   Method of device type definition for the pool is discussable,
   I'd go with a special attribute (array) or nvlist in the pool
   descriptor, rather than some special type ID in the ZFS label
   (backwards compatibility, see point 4 for detailed rationale).
 
   Discussable: enable pool-wide or per-dataset (i.e. don't
   waste accelerator space and lifetime for rarely-reused
   datasets like rolling backups)? Choose what to store on
   (particular) metaxels - DDT, BPTree, something else?
   Overall, this availability of choice is similar to choice
   of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
 
 2) These devices should be formally addressable as part of the
   pool in DVA terms (tlvdev:offset:size), but writes onto them
   are artificially limited by ZFS scheduler so as to only allow
   specific types of metadata blocks (blkptr_t's, DDT entries),
   and also enforce writing of added third copies (for blocks
   of metadata with usual copies=2) onto these devices.
 
 3) Absence or FAULTEDness of this device should not be fatal
   to the pool, but it may require manual intervention to force
   the import. Particularly, removal, replacement or resilvering
   onto different storage (i.e. migrating to larger SSDs) should
 

Re: [zfs-discuss] Recovering lost labels on raidz member

2012-08-13 Thread Richard Elling

On Aug 13, 2012, at 2:24 AM, Sašo Kiselkov wrote:

 On 08/13/2012 10:45 AM, Scott wrote:
 Hi Saso,
 
 thanks for your reply.
 
 If all disks are the same, is the root pointer the same?
 
 No.
 
 Also, is there a signature or something unique to the root block that I can
 search for on the disk?  I'm going through the On-disk specification at the
 moment.
 
 Nope. The checksums are part of the blockpointer, and the root
 blockpointer is in the uberblock, which itself resides in the label. By
 overwriting the label you've essentially erased all hope of practically
 finding the root of the filesystem tree - not even checksumming all
 possible block combinations (of which there are quite a few) will help
 you here, because you have no checksums to compare them against.
 
 I'd love to be wrong, and I might be (I don't have as intimate a
 knowledge of ZFS' on-disk structure as I'd like), but from where I'm
 standing, your raidz vdev is essentially lost.

The labels are not identical, because each contains the guid for the device.
It is possible, though nontrivial, to recreate.

That said, I've never seen a failure that just takes out only the ZFS labels.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering lost labels on raidz member

2012-08-13 Thread Richard Elling

On Aug 13, 2012, at 8:59 PM, Scott wrote:

 On Mon, Aug 13, 2012 at 10:40:45AM -0700, Richard Elling wrote:
 
 On Aug 13, 2012, at 2:24 AM, Sa?o Kiselkov wrote:
 
 On 08/13/2012 10:45 AM, Scott wrote:
 Hi Saso,
 
 thanks for your reply.
 
 If all disks are the same, is the root pointer the same?
 
 No.
 
 Also, is there a signature or something unique to the root block that I 
 can
 search for on the disk?  I'm going through the On-disk specification at the
 moment.
 
 Nope. The checksums are part of the blockpointer, and the root
 blockpointer is in the uberblock, which itself resides in the label. By
 overwriting the label you've essentially erased all hope of practically
 finding the root of the filesystem tree - not even checksumming all
 possible block combinations (of which there are quite a few) will help
 you here, because you have no checksums to compare them against.
 
 I'd love to be wrong, and I might be (I don't have as intimate a
 knowledge of ZFS' on-disk structure as I'd like), but from where I'm
 standing, your raidz vdev is essentially lost.
 
 The labels are not identical, because each contains the guid for the device.
 It is possible, though nontrivial, to recreate.
 
 That said, I've never seen a failure that just takes out only the ZFS labels.
 
 You'd have to go out of your way to take out the labels.  Which is just what
 I did (imagine: moving drives over to USB external enclosures, then putting
 them onto a HP Raid controller (which overwrites the end of the disk) - which
 also assumed that two disks should be automatically mirrored (if you miss the
 5 second prompt where you can tell it not to).

ouch. But that shouldn't be enough. 

 Then try and recover the labels without really knowing what you're doing (my 
 bad).

d'oh!

 Suffice to say I have no confidence in the labels of two drives.  On OI I can
 forcefully import the pool but with any file that lives on multiple disks (ie,
 over a certain size), all I get is an I/O error.  Some of datasets also fail
 to mount.

please tell me you imported readonly?
 -- richard

 
 Thanks everyone for your input.
 
 -- richard
 
 --
 ZFS Performance and Training
 richard.ell...@richardelling.com
 +1-760-896-4422
 
 
 
 
 
 
 

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FreeBSD ZFS

2012-08-09 Thread Richard Elling
On Aug 9, 2012, at 4:11 AM, joerg.schill...@fokus.fraunhofer.de (Joerg 
Schilling) wrote:

 Sa?o Kiselkov skiselkov...@gmail.com wrote:
 
 On 08/09/2012 01:05 PM, Joerg Schilling wrote:
 Sa?o Kiselkov skiselkov...@gmail.com wrote:
 
 To me it seems that the open-sourced ZFS community is not open, or 
 could you 
 point me to their mailing list archives?
 
 Jörg
 
 
 z...@lists.illumos.org
 
 Well, why then has there been a discussion about a closed zfs mailing 
 list?
 Is this no longer true?
 
 Not that I know of. The above one is where I post my changes and Matt,
 George, Garrett and all the others are lurking there.
 
 So if you frequently read this list, can you tell me whether they discuss the 
 on-disk format in this list?

Yes, but nobody has posted proposals for new on-disk format changes
since feature flags was first announced. 

NB, the z...@lists.illumos.org is but one of the many discuss groups
where ZFS users can get questions answered. There is also active
Mac OSX, ZFS on Linux, and OTN lists. IMHO, zfs-discuss@opensolaris 
is shrinking, not growing.
  -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-03 Thread Richard Elling
On Aug 2, 2012, at 5:40 PM, Nigel W wrote:

 On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
 
 
 Yes. +1
 
 The L2ARC as is it currently implemented is not terribly useful for
 storing the DDT in anyway because each DDT entry is 376 bytes but the
 L2ARC reference is 176 bytes, so best case you get just over double
 the DDT entries in the L2ARC as what you would get into the ARC but
 then you have also have no ARC left for anything else :(.
 
 
 You are making the assumption that each DDT table entry consumes one
 metadata update. This is not the case. The DDT is implemented as an AVL
 tree. As per other metadata in ZFS, the data is compressed. So you cannot
 make a direct correlation between the DDT entry size and the affect on the
 stored metadata on disk sectors.
 -- richard
 
 It's compressed even when in the ARC?


That is a slightly odd question. The ARC contains ZFS blocks. DDT metadata is 
manipulated in memory as an AVL tree, so what you can see in the ARC is the
metadata blocks that were read and uncompressed from the pool or packaged
in blocks and written to the pool. Perhaps it is easier to think of them as 
metadata
in transition? :-)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Richard Elling
On Aug 1, 2012, at 2:41 PM, Peter Jeremy wrote:

 On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote:
 I think a fantastic idea for dealing with the DDT (and all other
 metadata for that matter) would be an option to put (a copy of)
 metadata exclusively on a SSD.
 
 This is on my wishlist as well.  I believe ZEVO supports it so possibly
 it'll be available in ZFS in the near future.

ZEVO does not. The only ZFS vendor I'm aware of with a separate top-level
vdev for metadata is Tegile, and it is available today. 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Richard Elling
On Aug 1, 2012, at 8:30 AM, Nigel W wrote:
 On Wed, Aug 1, 2012 at 8:33 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 08/01/2012 04:14 PM, Jim Klimov wrote:
 chances are that
 some blocks of userdata might be more popular than a DDT block and
 would push it out of L2ARC as well...
 
 Which is why I plan on investigating implementing some tunable policy
 module that would allow the administrator to get around this problem.
 E.g. administrator dedicates 50G of ARC space to metadata (which
 includes the DDT) or only the DDT specifically. My idea is still a bit
 fuzzy, but it revolves primarily around allocating and policing min and
 max quotas for a given ARC entry type. I'll start a separate discussion
 thread for this later on once I have everything organized in my mind
 about where I plan on taking this.
 
 
 Yes. +1
 
 The L2ARC as is it currently implemented is not terribly useful for
 storing the DDT in anyway because each DDT entry is 376 bytes but the
 L2ARC reference is 176 bytes, so best case you get just over double
 the DDT entries in the L2ARC as what you would get into the ARC but
 then you have also have no ARC left for anything else :(.

You are making the assumption that each DDT table entry consumes one
metadata update. This is not the case. The DDT is implemented as an AVL
tree. As per other metadata in ZFS, the data is compressed. So you cannot
make a direct correlation between the DDT entry size and the affect on the
stored metadata on disk sectors.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unable to import the zpool

2012-08-02 Thread Richard Elling

On Aug 1, 2012, at 12:21 AM, Suresh Kumar wrote:

 Dear ZFS-Users,
 
 I am using Solarisx86 10u10, All the devices which are belongs to my zpool 
 are in available state .
 But I am unable to import the zpool.
 
 #zpool import tXstpool
 cannot import 'tXstpool': one or more devices is currently unavailable
 ==
 bash-3.2# zpool import
   pool: tXstpool
 id: 13623426894836622462
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
 devices and try again.
see: http://www.sun.com/msg/ZFS-8000-6X
 config:
 
 tXstpool UNAVAIL  missing device
   mirror-0   DEGRADED
 c2t210100E08BB2FC85d0s0  FAULTED  corrupted data
 c2t21E08B92FC85d2ONLINE
 
 Additional devices are known to be part of this pool, though their
 exact configuration cannot be determined.
 

This message is your clue. The pool is missing a device. In most of the cases
where I've seen this, it occurs on older ZFS implementations and the missing
device is an auxiliary device: cache or spare.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] encfs on top of zfs

2012-08-02 Thread Richard Elling

On Jul 31, 2012, at 8:05 PM, opensolarisisdeadlongliveopensolaris wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Richard Elling
 
 I believe what you meant to say was dedup with HDDs sux. If you had
 used fast SSDs instead of HDDs, you will find dedup to be quite fast.
  -- richard
 
 Yes, but this is a linear scale.  

No, it is definitely NOT a linear scale. Study Amdahl's law a little more 
carefully.

 Suppose an SSD without dedup is 100x faster than a HDD without dedup.  And 
 suppose dedup slows down a system by a factor of 10x.  Now your SSD with 
 dedup is only 10x faster than the HDD without dedup.  So quite fast is a 
 relative term.

Of course it is.

  The SSD with dedup is still faster than the HDD without dedup, but it's also 
 slower than the SSD without dedup.

duh. With dedup you are trading IOPS for space. In general, HDDs have lots of 
space and
terrible IOPS. SSDs have less space, but more IOPS. Obviously, as you point 
out, the best
solution is lots of space and lots of IOPS.

 The extent of fibbing I'm doing is thusly:  In reality, an SSD is about 
 equally fast with HDD for sequential operations, and about 100x faster for 
 random IO.  It just so happens that the dedup performance hit is almost 
 purely random IO, so it's right in the sweet spot of what SSD's handle well.  

In the vast majority of modern systems, there are no sequential I/O workloads. 
That is a myth 
propagated by people who still think HDDs can be fast.

 You can't use an overly simplified linear model like I described above - In 
 reality, there's a grain of truth in what Richard said, and also a grain of 
 truth in what I said.  The real truth is somewhere in between what he said 
 and what I said.

But closer to my truth :-)

 No, the SSD will not perform as well with dedup as it does without dedup.  
 But the suppose dedup slows down by 10x that I described above is not 
 accurate.  Depending on what you're doing, dedup might slow down an HDD by 
 20x, and it might only slow down SSD by 4x doing the same work load.  Highly 
 variable, and highly dependent on the specifics of your workload.

You are making the assumption that the system is not bandwidth limited. This is 
a
good assumption for the HDD case, because the media bandwidth is much less 
than the interconnect bandwidth. For SSDs, this assumption is not necessarily 
true.
There are SSDs that are bandwidth constrained on the interconnect, and in those
cases, your model fails.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Pool Unavailable

2012-08-01 Thread Richard Elling
On Aug 1, 2012, at 8:04 AM, Jesse Jamez wrote:

 Hello,
 
 I recently rebooted my workstation and the disk names changed causing my ZFS 
 pool to be unavailable.

What OS and release?

 
 I did not make any hardware changes?  My first question is the obvious?  Did 
 I loose my data?  Can I recover it?

Yes, just import the pool.

 
 What would cause the names to change? Delay in the order that the HBA brought 
 them up?

It depends on your OS and OBP (or BIOS).

 
 How can I correct this problem going forward?

The currently imported pool configurations are recorded in the 
/etc/zfs/zpool.cache
file for Solaris-like OSes. At boot time, the system will try to import the 
pools in the
cache. If the cache contents no longer match reality for non-root pools, then 
the
safest action is to not automatically import the pool. An error message is 
displayed
and should point to a website that tells you how to correct this (NB, depending 
on the
OS, that URL may or may not exist at Oracle (nee Sun))
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] encfs on top of zfs

2012-07-31 Thread Richard Elling

On Jul 31, 2012, at 10:07 AM, Nigel W wrote:

 On Tue, Jul 31, 2012 at 9:36 AM, Ray Arachelian r...@arachelian.com wrote:
 On 07/31/2012 09:46 AM, opensolarisisdeadlongliveopensolaris wrote:
 Dedup: First of all, I don't recommend using dedup under any
 circumstance. Not that it's unstable or anything, just that the
 performance is so horrible, it's never worth while. But particularly
 with encrypted data, you're guaranteed to have no duplicate data
 anyway, so it would be a pure waste. Don't do it.
 ___ zfs-discuss mailing
 list zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 One thing you can do is enable dedup when you copy all your data from
 one zpool to another, then, when you're done, disable dedup.  It will no
 longer waste a ton of memory, and your new volume will have a high dedup
 ratio. (Obviously anything you add after you turn dedup off won't be
 deduped.)  You can keep the old pool as a backup, or wipe it or whatever
 and later on do the same operation in the other direction.
 
 Once something is written deduped you will always use the memory when
 you want to read any files that were written when dedup was enabled,
 so you do not save any memory unless you do not normally access most
 of your data.
 
 Also don't let the system crash :D or try to delete too much from the
 deduped dataset :D (including snapshots or the dataset itself) because
 then you have to reload all (most) of the DDT in order to delete the
 files.  This gets a lot of people in trouble (including me at $work
 :|) because you need to have the ram available at all times to load
 the most (75% to grab a number out of the air) in case the server
 crashes. Otherwise you are stuck with a machine trying to verify its
 filesystem for hours. I have one test system that has 4 GB of RAM and
 2 TB of deduped data, when it crashes (panic, powerfailure, etc) it
 would take 8-12 hours to boot up again.  It now has 1TB of data and
 will boot in about 5 minutes or so.

I believe what you meant to say was dedup with HDDs sux. If you had
used fast SSDs instead of HDDs, you will find dedup to be quite fast.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-30 Thread Richard Elling
On Jul 30, 2012, at 10:20 AM, Roy Sigurd Karlsbakk wrote:
 - Opprinnelig melding -
 On Mon, Jul 30, 2012 at 9:38 AM, Roy Sigurd Karlsbakk
 r...@karlsbakk.net wrote:
 Also keep in mind that if you have an SLOG (ZIL on a separate
 device), and then lose this SLOG (disk crash etc), you will
 probably
 lose the pool. So if you want/need SLOG, you probably want two of
 them in a mirror…
 
 That's only true on older versions of ZFS. ZFSv19 (or 20?) includes
 the ability to import a pool with a failed/missing log device. You
 lose any data that is in the log and not in the pool, but the pool
 is
 importable.
 
 Are you sure? I booted this v28 pool a couple of months back, and
 found it didn't recognize its pool, apparently because of a missing
 SLOG. It turned out the cache shelf was disconnected, after
 re-connecting it, things worked as planned. I didn't try to force a
 new import, though, but it didn't boot up normally, and told me it
 couldn't import its pool due to lack of SLOG devices.
 
 Positive. :) I tested it with ZFSv28 on FreeBSD 9-STABLE a month or
 two ago. See the updated man page for zpool, especially the bit about
 import -m. :)
 
 On 151a2, man page just says 'use this or that mountpoint' with import -m, 
 but the fact was zpool refused to import the pool at boot when 2 SLOG devices 
 (mirrored) and 10 L2ARC devices were offline. Should OI/Illumos be able to 
 boot cleanly without manual action with the SLOG devices gone?

No. Missing slogs is a potential data-loss condition. Importing the pool without
slogs requires acceptance of the data-loss -- human interaction.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-30 Thread Richard Elling
On Jul 30, 2012, at 12:25 PM, Tim Cook wrote:
 On Mon, Jul 30, 2012 at 12:44 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 On Jul 30, 2012, at 10:20 AM, Roy Sigurd Karlsbakk wrote:
 - Opprinnelig melding -
 On Mon, Jul 30, 2012 at 9:38 AM, Roy Sigurd Karlsbakk
 r...@karlsbakk.net wrote:
 Also keep in mind that if you have an SLOG (ZIL on a separate
 device), and then lose this SLOG (disk crash etc), you will
 probably
 lose the pool. So if you want/need SLOG, you probably want two of
 them in a mirror…
 
 That's only true on older versions of ZFS. ZFSv19 (or 20?) includes
 the ability to import a pool with a failed/missing log device. You
 lose any data that is in the log and not in the pool, but the pool
 is
 importable.
 
 Are you sure? I booted this v28 pool a couple of months back, and
 found it didn't recognize its pool, apparently because of a missing
 SLOG. It turned out the cache shelf was disconnected, after
 re-connecting it, things worked as planned. I didn't try to force a
 new import, though, but it didn't boot up normally, and told me it
 couldn't import its pool due to lack of SLOG devices.
 
 Positive. :) I tested it with ZFSv28 on FreeBSD 9-STABLE a month or
 two ago. See the updated man page for zpool, especially the bit about
 import -m. :)
 
 On 151a2, man page just says 'use this or that mountpoint' with import -m, 
 but the fact was zpool refused to import the pool at boot when 2 SLOG 
 devices (mirrored) and 10 L2ARC devices were offline. Should OI/Illumos be 
 able to boot cleanly without manual action with the SLOG devices gone?
 
 No. Missing slogs is a potential data-loss condition. Importing the pool 
 without
 slogs requires acceptance of the data-loss -- human interaction.
  -- richard
 
 --
 ZFS Performance and Training
 richard.ell...@richardelling.com
 +1-760-896-4422
 
 
 
 I would think a flag to allow you to automatically continue with a disclaimer 
 might be warranted (default behavior obviously requiring human input). 

Disagree, the appropriate action is to boot as far as possible.
The pool will not be imported and will have the normal fault management
alerts generated.

For interactive use, the import will fail, and you can add the -m option.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-29 Thread Richard Elling
On Jul 29, 2012, at 7:07 AM, Jim Klimov wrote:

 Hello, list
 
  For several times now I've seen statements on this list implying
 that a dedicated ZIL/SLOG device catching sync writes for the log,
 also allows for more streamlined writes to the pool during normal
 healthy TXG syncs, than is the case with the default ZIL located
 within the pool.

I'm not sure where you are heading here. Space for the data in the
pool is allocated based on the policies of the pool.

  Is this understanding correct? Does it apply to any generic writes,
 or only to sync-heavy scenarios like databases or NFS servers?


Async writes don't use the ZIL.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-29 Thread Richard Elling
On Jul 29, 2012, at 1:53 PM, Jim Klimov wrote:

 2012-07-30 0:40, opensolarisisdeadlongliveopensolaris пишет:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
For several times now I've seen statements on this list implying
 that a dedicated ZIL/SLOG device catching sync writes for the log,
 also allows for more streamlined writes to the pool during normal
 healthy TXG syncs, than is the case with the default ZIL located
 within the pool.
 
 It might just be more clear, if it's stated differently:
 
 At any given time, your pool is in one of four states:  idle, reading, 
 writing, or idle with writes queued but not currently being written.  Now a 
 sync write operation takes place.  If you have a dedicated log, it goes 
 directly to the log, and it doesn't interfere with any of the other 
 operations that might be occurring right now.  You don't have to interrupt 
 your current activity, simply, your sync write goes to a dedicated device 
 that's guaranteed to be idle in relation to all that other stuff.  Then the 
 sync write becomes async, and gets coalesced into the pending TXG.
 
 If you don't have a dedicated log, then the sync write jumps the write 
 queue, and becomes next in line.  It waits for the present read or write 
 operation to complete, and then the sync write hits the disk, and flushes 
 the disk buffer.  This means the sync write suffered a penalty waiting for 
 the main pool disks to be interruptible.  Without slog, you're causing delay 
 to your sync writes, and you're causing delay before the next read or write 
 operation can begin...  But that's it.  Without slog, your operations are 
 serial, whereas, with slog your sync write can occur in parallel to your 
 other operations.
 
 There's no extra fragmentation, with or without slog.  Because in either 
 case, the sync write hits some dedicated and recyclable disk blocks, and 
 then it becomes async and coalesced with all the other async writes.  The 
 layout and/or fragmentation characteristics of the permanent TXG to be 
 written to the pool is exactly the same either way.
 
 Thanks... but doesn't your description imply that the sync writes
 would always be written twice? It should be with dedicated SLOG, but
 even with one, I think, small writes hit the SLOG and large ones
 go straight to the pool devices (and smaller blocks catch up from
 the TXG queue upon TXG flush). However, without a dedicated SLOG,
 I thought that the writes into the ZIL happen once on the main
 pool devices, and then are referenced from the live block pointer
 tree without being rewritten elsewhere (and for the next TXG some
 other location may be used for the ZIL). Maybe I am wrong, because
 it would also make sense for small writes to hit the disk twice
 indeed, and the same pool location(s) being reused for the ZIL.

You are both right and wrong, at the same time. It depends on the data.
Without a slog, writes that are larger than zfs_immediate_write_sz are
written to the permanent place in the pool. Please review (again) my 
slides on the subject.
http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
slide 78.

For those who prefer to be lecturered, another opportunity will arise in
December 2012 in San Diego at the LISA'12 conference.. I am revamping
much of the material from 2011, to catch up with all of the cool new things
that arrived and are due this year.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IO load questions

2012-07-25 Thread Richard Elling
On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote:

 NFS – iSCSI and FC/FCoE to come once I get it into the proper lab.

ok, so NFS for these tests.

I'm not convinced a single ESXi box can drive the load to saturate 10GbE.

Also, depending on how you are configuring the system, the I/O that you 
think is 4KB might look very different coming out of ESXi. Use nfssvrtop
or one of the many dtrace one-liners for observing NFS traffic to see what is
really on the wire. And I'm very interested to know if you see 16KB reads
during the write-only workload.

more below...


 From: Richard Elling [mailto:richard.ell...@gmail.com] 
 Sent: Tuesday, July 24, 2012 11:36 PM
 To: matth...@flash.shanje.com
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] IO load questions
  
 Important question, what is the interconnect? iSCSI? FC? NFS?
  -- richard
  
 On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote:
 
 
 Working on a POC for high IO workloads, and I’m running in to a bottleneck 
 that I’m not sure I can solve.  Testbed looks like this :
 
 SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU’s, 72GB RAM, and ESXi
 VM – 4GB RAM, 1vCPU
 Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010
 
 Target Nexenta system :
 
 Intel barebones, Dual Xeon 5620 CPU’s, 192GB RAM, Nexenta 3.1.3 Enterprise
 Intel x520 dual port 10Gbit Ethernet – LACP Active VPC to Nexus 5010 switches.
 2x LSI 9201-16E HBA’s, 1x LSI 9200-8e HBA
 5 DAE’s (3 in use for this test)
 1 DAE – connected (multipathed) to LSI 9200-8e.  Loaded w/ 6x Stec ZeusRAM 
 SSD’s – striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC.
 2 DAE’s connected (multipathed) to one LSI 9201-16E – 24x 600GB 15k Seagate 
 Cheetah drives
 Obviously data integrity is not guaranteed
 
 Testing using IOMeter from windows guest, 10GB test file, queue depth of 64
 I have a share set up with 4k recordsizes, compression disabled, access time 
 disabled, and am seeing performance as follows :
 
 ~50,000 IOPS 4k random read.  200MB/sec, 30% CPU utilization on Nexenta, ~90% 
 utilization on guest OS.  I’m guessing guest OS is bottlenecking.  Going to 
 try physical hardware next week
 ~25,000 IOPS 4k random write.  100MB/sec, ~70% CPU utilization on Nexenta, 
 ~45% CPU utilization on guest OS.  Feels like Nexenta CPU is bottleneck. Load 
 average of 2.5

For cases where you are not bandwidth limited, larger recordsizes can be more 
efficient. There
is no good rule-of-thumb for this, and larger recordsizes will, at some point, 
hit the bandwidth
bottlenecks. I've had good luck with 8KB and 32KB recordsize for ESXi+Windows 
over NFS.
I've never bothered to test 16KB, due to lack of time.

 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec 
 performance, can’t remember CPU utilization on either side. Will retest and 
 report those numbers.

It would not surprise me to see a CPU bottleneck on the ESXi side at these 
levels.
 -- richard

 
 It feels like something is adding more overhead here than I would expect on 
 the 4k recordsizes/IO workloads.  Any thoughts where I should start on this?  
 I’d really like to see closer to 10Gbit performance here, but it seems like 
 the hardware isn’t able to cope with it?
  
 Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, 
 unidirectional.
 This workload is extraordinarily difficult to achieve with a single client 
 using any of the popular
 storage protocols.
  -- richard
  
 --
 ZFS Performance and Training
 richard.ell...@richardelling.com
 +1-760-896-4422
  
  
  
  
 
 
  

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IO load questions

2012-07-24 Thread Richard Elling
Important question, what is the interconnect? iSCSI? FC? NFS?
 -- richard

On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote:

 Working on a POC for high IO workloads, and I’m running in to a bottleneck 
 that I’m not sure I can solve.  Testbed looks like this :
 
 SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU’s, 72GB RAM, and ESXi
 VM – 4GB RAM, 1vCPU
 Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010
 
 Target Nexenta system :
 
 Intel barebones, Dual Xeon 5620 CPU’s, 192GB RAM, Nexenta 3.1.3 Enterprise
 Intel x520 dual port 10Gbit Ethernet – LACP Active VPC to Nexus 5010 switches.
 2x LSI 9201-16E HBA’s, 1x LSI 9200-8e HBA
 5 DAE’s (3 in use for this test)
 1 DAE – connected (multipathed) to LSI 9200-8e.  Loaded w/ 6x Stec ZeusRAM 
 SSD’s – striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC.
 2 DAE’s connected (multipathed) to one LSI 9201-16E – 24x 600GB 15k Seagate 
 Cheetah drives
 Obviously data integrity is not guaranteed
 
 Testing using IOMeter from windows guest, 10GB test file, queue depth of 64
 I have a share set up with 4k recordsizes, compression disabled, access time 
 disabled, and am seeing performance as follows :
 
 ~50,000 IOPS 4k random read.  200MB/sec, 30% CPU utilization on Nexenta, ~90% 
 utilization on guest OS.  I’m guessing guest OS is bottlenecking.  Going to 
 try physical hardware next week
 ~25,000 IOPS 4k random write.  100MB/sec, ~70% CPU utilization on Nexenta, 
 ~45% CPU utilization on guest OS.  Feels like Nexenta CPU is bottleneck. Load 
 average of 2.5
 
 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec 
 performance, can’t remember CPU utilization on either side. Will retest and 
 report those numbers.
 
 It feels like something is adding more overhead here than I would expect on 
 the 4k recordsizes/IO workloads.  Any thoughts where I should start on this?  
 I’d really like to see closer to 10Gbit performance here, but it seems like 
 the hardware isn’t able to cope with it?

Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, 
unidirectional.
This workload is extraordinarily difficult to achieve with a single client 
using any of the popular
storage protocols.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow speed problem with a new SAS shelf

2012-07-23 Thread Richard Elling
On Jul 22, 2012, at 10:18 PM, Yuri Vorobyev wrote:

 Hello.
 
 I faced with a strange performance problem with new disk shelf.
 We a using  ZFS system with SATA disks for a while.

What OS and release?
 -- richard

 It is Supermicro SC846-E16 chassis, Supermicro X8DTH-6F motherboard with 96Gb 
 RAM and 24 HITACHI HDS723020BLA642 SATA disks attached to onboard LSI 2008 
 controller.
 
 Pretty much satisfied with it we bought additional shelf with SAS disks for 
 VMs hosting. New shelf is Supermicro SC846-E26 chassis. Disks model is 
 HITACHI HUS156060VLS600 (15K 600Gb SAS2).
 Additional controller LSI 9205-8e was installed in server and connected with 
 JBOD.
 I connected JBOD with 2 channels and setup multi path first, but when i 
 noticed performance problem i disabled multi path and disconnected one cable 
 (for sure it is not multipath cause the problem).
 
 Problem description follow:
 
 Creating test pool with 5 pair of mirrors (new shelf, SAS disks)
 
 # zpool create -o version=28 -O primarycache=none test mirror 
 c9t5000CCA02A138899d0 c9t5000CCA02A102181d0 mirror c9t5000CCA02A13500Dd0 
 c9t5000CCA02A13316Dd0 mirror c9t5000CCA02A005699d0 c9t5000CCA02A004271d0 
 mirror c9t5000CCA02A004229d0 c9t5000CCA02A1342CDd0 mirror 
 c9t5000CCA02A1251E5d0 c9t5000CCA02A1151DDd0
 
 (primarycache=none) to disable ARC influence
 
 
 Testing sequential write
 # dd if=/dev/zero of=/test/zero bs=1M count=2048
 2048+0 records in
 2048+0 records out
 2147483648 bytes (2.1 GB) copied, 1.04272 s, 2.1 GB/s
 
 iostat when writing look like
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.0 1334.60.0 165782.9  0.0  8.40.06.3   1  86 
 c9t5000CCA02A1151DDd0
 0.0 1345.50.0 169575.3  0.0  8.70.06.5   1  88 
 c9t5000CCA02A1342CDd0
 2.0 1359.51.0 168969.8  0.0  8.70.06.4   1  90 
 c9t5000CCA02A13500Dd0
 0.0 1358.50.0 168714.0  0.0  8.70.06.4   1  90 
 c9t5000CCA02A13316Dd0
 0.0 1345.50.0 19.3  0.0  9.00.06.7   1  92 
 c9t5000CCA02A102181d0
 1.0 1317.51.0 164456.9  0.0  8.50.06.5   1  88 
 c9t5000CCA02A004271d0
 4.0 1342.52.0 166282.2  0.0  8.50.06.3   1  88 
 c9t5000CCA02A1251E5d0
 0.0 1377.50.0 170515.5  0.0  8.70.06.3   1  90 
 c9t5000CCA02A138899d0
 
 Now read
 # dd if=/test/zero of=/dev/null  bs=1M
 2048+0 records in
 2048+0 records out
 2147483648 bytes (2.1 GB) copied, 13.5681 s, 158 MB/s
 
 iostat when reading
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   106.00.0 11417.40.0  0.0  0.20.02.4   0  14 
 c9t5000CCA02A004271d0
80.00.0 10239.90.0  0.0  0.20.02.4   0  10 
 c9t5000CCA02A1251E5d0
   110.00.0 12182.40.0  0.0  0.10.01.3   0   9 
 c9t5000CCA02A138899d0
   102.00.0 11664.40.0  0.0  0.20.01.8   0  15 
 c9t5000CCA02A005699d0
99.00.0 10900.90.0  0.0  0.30.03.0   0  16 
 c9t5000CCA02A004229d0
   107.00.0 11545.40.0  0.0  0.20.01.9   0  13 
 c9t5000CCA02A1151DDd0
81.00.0 10367.90.0  0.0  0.20.02.2   0  11 
 c9t5000CCA02A1342CDd0
 
 Unexpected low speed! Note the busy column. When writing it about 90%, when 
 reading it about 15%
 
 Individual disks raw read speed (don't be confused with name change. i 
 connect JBOD to another HBA channel)
 
 # dd if=/dev/dsk/c8t5000CCA02A13889Ad0 of=/dev/null bs=1M count=2000
 2000+0 records in
 2000+0 records out
 2097152000 bytes (2.1 GB) copied, 10.9685 s, 191 MB/s
 # dd if=/dev/dsk/c8t5000CCA02A1342CEd0 of=/dev/null bs=1M count=2000
 2000+0 records in
 2000+0 records out
 2097152000 bytes (2.1 GB) copied, 10.8024 s, 194 MB/s
 
 The 10-disks mirror zpool read slower than a single disk.
 
 There is no tuning in /etc/system
 
 I tried test with FreeBSD 8.3 live CD. Reads was the same (about 150Mb/s). 
 Also i tried SmartOS, but it can't see disks behind LSI 9205-8e controller.
 
 For compare this is speed from SATA pool (it consist of 4 6-disk raidz2 vdev)
 #dd if=CentOS-6.2-x86_64-bin-DVD1.iso of=/dev/null bs=1M
 4218+1 records in
 4218+1 records out
 4423129088 bytes (4.4 GB) copied, 4.76552 s, 928 MB/s
 
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13614.40.0 800338.50.0  0.1 36.00.02.6   0 914 c6
   459.90.0 25761.40.0  0.0  0.80.01.8   0  22 
 c6t5000CCA369D16860d0
84.00.0 2785.20.0  0.0  0.20.03.0   0  13 
 c6t5000CCA369D1B1E0d0
   836.90.0 50089.50.0  0.0  2.60.03.1   0  60 
 c6t5000CCA369D1B302d0
   411.00.0 24492.60.0  0.0  0.80.02.1   0  25 
 c6t5000CCA369D16982d0
   821.90.0 49385.10.0  0.0  3.00.03.7   0  67 
 c6t5000CCA369CFBDA3d0
   231.00.0 12292.50.0  0.0  0.50.02.3   0  18 
 c6t5000CCA369D17E73d0
   803.90.0 50091.50.0  0.0  2.90.03.6   1  69 
 c6t5000CCA369D0EA93d0
 
 PS. Before testing i flash last firmware and bios 

Re: [zfs-discuss] zfs sata mirror slower than single disk

2012-07-16 Thread Richard Elling

On Jul 16, 2012, at 2:43 AM, Michael Hase wrote:

 Hello list,
 
 did some bonnie++ benchmarks for different zpool configurations
 consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512
 bytes/sector, 7.2k), and got some strange results, please see
 attachements for exact numbers and pool config:
 
  seq write  factor   seq read  factor
  MB/sec  MB/sec
 single1231135   1
 raid0 1141249   2
 mirror 570.5  129   1
 
 Each of the disks is capable of about 135 MB/sec sequential reads and
 about 120 MB/sec sequential writes, iostat -En shows no defects. Disks
 are 100% busy in all tests, and show normal service times.

For 7,200 rpm disks, average service times should be on the order of 10ms
writes and 13ms reads. If you see averages  20ms, then you are likely 
running into scheduling issues.
 -- richard

 This is on
 opensolaris 130b, rebooting with openindiana 151a live cd gives the
 same results, dd tests give the same results, too. Storage controller
 is an lsi 1068 using mpt driver. The pools are newly created and
 empty. atime on/off doesn't make a difference.
 
 Is there an explanation why
 
 1) in the raid0 case the write speed is more or less the same as a
 single disk.
 
 2) in the mirror case the write speed is cut by half, and the read
 speed is the same as a single disk. I'd expect about twice the
 performance for both reading and writing, maybe a bit less, but
 definitely more than measured.
 
 For comparison I did the same tests with 2 old 2.5 36gb sas 10k disks
 maxing out at about 50-60 MB/sec on the outer tracks.
 
  seq write  factor   seq read  factor
  MB/sec  MB/sec
 single 381 50   1
 raid0  892111   2
 mirror 361 92   2
 
 Here we get the expected behaviour: raid0 with about double the
 performance for reading and writing, mirror about the same performance
 for writing, and double the speed for reading, compared to a single
 disk. An old scsi system with 4x2 mirror pairs also shows these
 scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec
 write, each disk capable of 80 MB/sec. I don't care about absolute
 numbers, just don't get why the sata system is so much slower than
 expected, especially for a simple mirror. Any ideas?
 
 Thanks,
 Michael
 
 -- 
 Michael Hase
 http://edition-software.desata.txtsas.txt___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
Thanks Sašo!
Comments below...

On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:

 Hi guys,
 
 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256.

No need to supplant, there are 8 bits for enumerating hash algorithms, so 
adding another is simply a matter of coding. With the new feature flags, it is
almost trivial to add new algorithms without causing major compatibility 
headaches. Darren points out that Oracle is considering doing the same,
though I do not expect Oracle to pick up the feature flags.

 On modern
 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
 slower than many of the SHA-3 candidates, so I went out and did some
 testing (details attached) on a possible new hash algorithm that might
 improve on this situation.
 
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
 
 * SHA-512: simplest to implement (since the code is already in the
   kernel) and provides a modest performance boost of around 60%.
 
 * Skein-512: overall fastest of the SHA-3 finalists and much faster
   than SHA-512 (around 120-150% faster than the current sha256).
 
 * Edon-R-512: probably the fastest general purpose hash algorithm I've
   ever seen (upward of 300% speedup over sha256) , but might have
   potential security problems (though I don't think this is of any
   relevance to ZFS, as it doesn't use the hash for any kind of security
   purposes, but only for data integrity  dedup).
 
 My testing procedure: nothing sophisticated, I took the implementation
 of sha256 from the Illumos kernel and simply ran it on a dedicated
 psrset (where possible with a whole CPU dedicated, even if only to a
 single thread) - I tested both the generic C implementation and the
 Intel assembly implementation. The Skein and Edon-R implementations are
 in C optimized for 64-bit architectures from the respective authors (the
 most up to date versions I could find). All code has been compiled using
 GCC 3.4.3 from the repos (the same that can be used for building
 Illumos). Sadly, I don't have access to Sun Studio.

The last studio release suitable for building OpenSolaris is available in the 
repo.
See the instructions at 
http://wiki.illumos.org/display/illumos/How+To+Build+illumos

I'd be curious about whether you see much difference based on studio 12.1,
gcc 3.4.3 and gcc 4.4 (or even 4.7)
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 10:11 AM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Richard Elling wrote:
 The last studio release suitable for building OpenSolaris is available in 
 the repo.
 See the instructions at 
 http://wiki.illumos.org/display/illumos/How+To+Build+illumos
 
 Not correct as far as I can tell.  You should re-read the page you 
 referenced.  Oracle recinded (or lost) the special Studio releases needed to 
 build the OpenSolaris kernel.  The only way I can see to obtain these 
 releases is illegally.

In the US, the term illegal is most often used for criminal law. Contracts 
between parties are covered
under civil law. It is the responsibility of the parties to agree to and 
enforce civil contracts. This includes
you, dear reader.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 10:23 AM, Sašo Kiselkov wrote:

 Hi Richard,
 
 On 07/11/2012 06:58 PM, Richard Elling wrote:
 Thanks Sašo!
 Comments below...
 
 On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote:
 
 Hi guys,
 
 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256.
 
 No need to supplant, there are 8 bits for enumerating hash algorithms, so 
 adding another is simply a matter of coding. With the new feature flags, it 
 is
 almost trivial to add new algorithms without causing major compatibility 
 headaches. Darren points out that Oracle is considering doing the same,
 though I do not expect Oracle to pick up the feature flags.
 
 I meant in the functional sense, not in the technical - of course, my
 changes would be implemented as a feature flags add-on.

Great! Let's do it! 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Richard Elling
On Jul 11, 2012, at 1:06 PM, Bill Sommerfeld wrote:
 on a somewhat less serious note, perhaps zfs dedup should contain chinese
 lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
 which asks the sysadmin to report a detected sha-256 collision to
 eprint.iacr.org or the like...


Agree. George was in that section of the code a few months ago (zio.c) and I 
asked
him to add a kstat, at least. I'll follow up with him next week, or get it done 
some other
way.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones

2012-07-10 Thread Richard Elling
To amplify what Mike says...

On Jul 10, 2012, at 5:54 AM, Mike Gerdts wrote:
 ls(1) tells you how much data is in the file - that is, how many bytes
 of data that an application will see if it reads the whole file.
 du(1) tells you how many disk blocks are used.  If you look at the
 stat structure in stat(2), ls reports st_size, du reports st_blocks.
 
 Blocks full of zeros are special to zfs compression - it recognizes
 them and stores no data.  Thus, a file that contains only zeros will
 only require enough space to hold the file metadata.
 
 $ zfs list -o compression ./
 COMPRESS
  on
 
 $ dd if=/dev/zero of=1gig count=1024 bs=1024k
 1024+0 records in
 1024+0 records out
 
 $ ls -l 1gig
 -rw-r--r--   1 mgerdts  staff1073741824 Jul 10 07:52 1gig

ls -ls shows the length (as in -l) and size (as in -s, units=blocks)
So you can see that it takes only space for metadata.
   1 -rw-r--r--   1 root root 1073741824 Nov 26 06:52 1gig
size  length


 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scenario sanity check

2012-07-06 Thread Richard Elling
First things first, the panic is a bug. Please file one with your OS supplier.
More below...

On Jul 6, 2012, at 4:55 PM, Ian Collins wrote:

 On 07/ 7/12 11:29 AM, Brian Wilson wrote:
 On 07/ 6/12 04:17 PM, Ian Collins wrote:
 On 07/ 7/12 08:34 AM, Brian Wilson wrote:
 Hello,
 
 I'd like a sanity check from people more knowledgeable than myself.
 I'm managing backups on a production system.  Previously I was using
 another volume manager and filesystem on Solaris, and I've just switched
 to using ZFS.
 
 My model is -
 Production Server A
 Test Server B
 Mirrored storage arrays (HDS TruCopy if it matters)
 Backup software (TSM)
 
 Production server A sees the live volumes.
 Test Server B sees the TruCopy mirrors of the live volumes.  (it sees
 the second storage array, the production server sees the primary array)
 
 Production server A shuts down zone C, and exports the zpools for
 zone C.
 Production server A splits the mirror to secondary storage array,
 leaving the mirror writable.
 Production server A re-imports the pools for zone C, and boots zone C.
 Test Server B imports the ZFS pool using -R /backup.
 Backup software backs up the mounted mirror volumes on Test Server B.
 
 Later in the day after the backups finish, a script exports the ZFS
 pools on test server B, and re-establishes the TruCopy mirror between
 the storage arrays.
 That looks awfully complicated.   Why don't you just clone a snapshot
 and back up the clone?
 
 Taking a snapshot and cloning incurs IO.  Backing up the clone incurs a
 lot more IO reading off the disks and going over the network.  These
 aren't acceptable costs in my situation.

Yet it is acceptable to shut down the zones and export the pools? 
I'm interested to understand how a service outage is preferred over I/O?

 So splitting a mirror and reconnecting it doesn't incur I/O?

It does.

 The solution is complicated if you're starting from scratch.  I'm
 working in an environment that already had all the pieces in place
 (offsite synchronous mirroring, a test server to mount stuff up on,
 scripts that automated the storage array mirror management, etc).  It
 was setup that way specifically to accomplish short downtime outages for
 cold backups with minimal or no IO hit to production.  So while it's
 complicated, when it was put together it was also the most obvious thing
 to do to drop my backup window to almost nothing, and keep all the IO
 from the backup from impacting production.  And like I said, with a
 different volume manager, it's been rock solid for years.

... where data corruption is blissfully ignored? I'm not sure what volume 
manager you were using, but SVM has absolutely zero data integrity 
checking :-(  And no, we do not miss using SVM :-)


 So, to ask the sanity check more specifically -
 Is it reasonable to expect ZFS pools to be exported, have their luns
 change underneath, then later import the same pool on those changed
 drives again?

Yes, we do this quite frequently. And it is tested ad nauseum. Methinks it is
simply a bug, perhaps one that is already fixed.

 If you were splitting ZFS mirrors to read data from one half all would be 
 sweet (and you wouldn't have to export the pool).  I guess the question here 
 is what does TruCopy do under the hood when you re-connect the mirror?

Yes, this is one of the use cases for zpool split. However, zpool split creates 
a new
pool, which is not what Brian wants, because to reattach the disks requires a 
full resilver.
Using TrueCopy as he does, is a reasonable approach for Brian's use case.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very sick iSCSI pool

2012-06-29 Thread Richard Elling
Hi Ian,
Chapter 7 of the DTrace book has some examples of how to look at iSCSI target
and initiator behaviour.
 -- richard

On Jun 28, 2012, at 10:47 PM, Ian Collins wrote:

 I'm trying to work out the case a remedy for a very sick iSCSI pool on a 
 Solaris 11 host.
 
 The volume is exported from an Oracle storage appliance and there are no 
 errors reported there.  The host has no entries in its logs relating to the 
 network connections.
 
 Any zfs or zpool commands the change the state of the pool (such as zfs mount 
 or zpool export) hang and can't be killed.
 
 fmadm faulty reports:
 
 Jun 27 14:04:24 536fb2ad-1fca-c8b2-fc7d-f5a4a94c165d  ZFS-8000-FDMajor
 
 Host: taitaklsc01
 Platform: SUN-FIRE-X4170-M2-SERVER  Chassis_id  : 1142FMM02N
 Product_sn  : 1142FMM02N
 
 Fault class : fault.fs.zfs.vdev.io
 Affects : zfs://pool=fileserver/vdev=68c1bdefa6f97db8
  faulted but still in service
 Problem in  : zfs://pool=fileserver/vdev=68c1bdefa6f97db8
  faulted but still in service
 
 Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to 
 http://sun.com/msg/ZFS-8000-FD
  for more information.
 
 The zpool status paints a very gloomy picture:
 
  pool: fileserver
 state: ONLINE
 status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jun 29 11:59:59 2012
858K scanned out of 15.7T at 43/s, (scan is slow, no estimated time)
567K resilvered, 0.00% done
 config:
 
NAME STATE READ WRITE CKSUM
fileserver   ONLINE   0 1.16M 0
  c0t600144F096C94AC74ECD96F20001d0  ONLINE   0 1.16M 0  
 (resilvering)
 
 errors: 1557164 data errors, use '-v' for a list
 
 Any ideas how to determine the cause of the problem and remedy it?
 
 -- 
 Ian.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] oddity of slow zfs destroy

2012-06-25 Thread Richard Elling
On Jun 25, 2012, at 10:55 AM, Philip Brown wrote:

 I ran into something odd today:
 
 zfs destroy -r  random/filesystem
 
 is mindbogglingly slow. But seems to me, it shouldnt be.
 It's slow, because the filesystem has two snapshots on it. Presumably, it's 
 busy rolling back the snapshots.
 but I've already declared by my command line, that I DONT CARE about the 
 contents of the filesystem!
 Why doesnt zfs simply do:
 
 1. unmount filesystem, if possible (it was possible)
 (1.5 possibly note intent to delete somewhere in the pool records)
 2. zero out/free the in-kernel-memory in one go
 3. update the pool, hey I deleted the filesystem, all these blocks are now 
 clear
 
 
 Having this kind of operation take more than even 10 seconds, seems like a 
 huge bug to me. yet it can take many minutes. An order of magnitude off. yuck.

Agree. Asynchronous destroy has been integrated into illumos. Look for it soon
in the distributions derived from illumos soon. For more information, see Chris
Siden and Matt Ahrens discussions on async destroy and ZFS feature flags at
the ZSF Meetup in January 2012 here:
http://blog.delphix.com/ahl/2012/zfs10-illumos-meetup/

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-20 Thread Richard Elling
On Jun 20, 2012, at 4:08 AM, Jim Klimov wrote:
 
 Also by default if you don't give the whole drive to ZFS, its cache
 may be disabled upon pool import and you may have to reenable it
 manually (if you only actively use this disk for one or more ZFS
 pools - which play with caching nicely).

This is not correct. 
The behaviour is to attempt to enable the disk's write cache if ZFS has the 
whole disk. Relevant code:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#319

Please help us to stop propagating the misinformation that ZFS disables 
write caches.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-20 Thread Richard Elling
On Jun 20, 2012, at 5:08 PM, Jim Klimov wrote:
 2012-06-21 1:58, Richard Elling wrote:
 On Jun 20, 2012, at 4:08 AM, Jim Klimov wrote:
 
 Also by default if you don't give the whole drive to ZFS, its cache
 may be disabled upon pool import and you may have to reenable it
 
 The behaviour is to attempt to enable the disk's write cache if ZFS has the
 whole disk. Relevant code:
 http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#319
 
 Please help us to stop propagating the misinformation that ZFS disables
 write caches.
  -- richard
 
 I see, sorry. So, the possible states are:
 
 1) Before pool import, disk cache was disabled; then pool is imported:
 1a) If ZFS has whole disk (how is that defined BTW, since partitions
and slices are really used? Is the presence of a slice#7 which
is 16384 sector long the trigger?) - then cache is enabled;

by the command use:
zpool create c0t0d0 == whole disk
zpool create c0t0d0s0 == not whole disk

 1b) ZFS does not have whole disk - cache is neither enabled nor
disabled;
 
 2) Before import disk cache was enabled; after import: no change
   regardless of whole-diskness.

correct

 
 Is this correct?
 
 How does a disk become cache disabled then - only manually?
 Or due to UFS usage? Or does it inherit HW setting? Or somehow else?

For Sun, it was done by setting the disk firmware.

 I think the cache is enabled in the OS by default…

In general, illumos does not touch the cache. I don't know of a way to
set the cache policy in most BIOSes. In some cases, you can set it using
format(1m), but whether it remains set after power-off depends on the
drive manufacturer.

Bottom line: don't worry about it.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks

2012-06-16 Thread Richard Elling
On Jun 15, 2012, at 7:37 AM, Hung-Sheng Tsao Ph.D. wrote:

 by the way
 when you format start with cylinder 1 donot use 0

There is no requirement for skipping cylinder 0 for root on Solaris, and there
never has been.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Richard Elling
[Phil beat me to it]
Yes, the 0s are a result of integer division in DTrace/kernel.

On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:

 Indeed they are there, shown with 1 second interval.  So, it is the
 client's fault after all.  I'll have to see whether it is somehow
 possible to get the server to write cached data sooner (and hopefully
 asynchronous), and the client to issue commits less often.  Luckily I
 can live with the current behavior (and the SSDs shouldn't give out
 any time soon even being used like this), if it isn't possible to
 change it.

If this is the proposed workload, then it is possible to tune the DMU to
manage commits more efficiently. In an ideal world, it does this automatically,
but the algorithms are based on a bandwidth calculation and those are not
suitable for HDD capacity planning. The efficiency goal would be to do less
work, more often and there are two tunables that can apply:

1. the txg_timeout controls the default maximum transaction group commit
interval and is set to 5 seconds on modern ZFS implementations.

2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg 
will
be committed when the size reaches this limit, rather than waiting for the
txg_timeout. For streaming writes, this can work better than tuning the 
txg_timeout.

 -- richard

 
 Thanks for all the help,
 Tim
 
 On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman phil.har...@gmail.com wrote:
 On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote:
 
 The client is using async writes, that include commits. Sync writes do not
 need commits.
 
 Are you saying nfs commit operations sent by the client aren't always
 reported by that script?
 
 They are not reported in your case because the commit rate is less than one 
 per second.
 
 DTrace is an amazing tool, but it does dictate certain coding compromises, 
 particularly when it comes to output scaling, grouping, sorting and 
 formatting.
 
 In this script the commit rate is calculated using integer division. In your 
 case the sample interval is 5 seconds, so up to 4 commits per second will be 
 reported as a big fat zero.
 
 If you use a sample interval of 1 second you should see occasional commits. 
 We know they are there because we see a non-zero commit time.
 
 

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Richard Elling
On Jun 14, 2012, at 1:35 PM, Robert Milkowski wrote:

 The client is using async writes, that include commits. Sync writes do
 not need commits.
 
 What happens is that the ZFS transaction group commit occurs at more-
 or-less regular intervals, likely 5 seconds for more modern ZFS
 systems. When the commit occurs, any data that is in the ARC but not
 commited in a prior transaction group gets sent to the ZIL
 
 Are you sure? I don't think this is the case unless I misunderstood you or
 this is some recent change to Illumos.

Need to make sure we are clear here, there is time between the txg being
closed and the txg being on disk. During that period, a sync write of the
data in the closed txg is written to the ZIL.

 Whatever is being committed when zfs txg closes goes directly to pool and
 not to zil. Only sync writes will go to zil right a way (and not always, see
 logbias, etc.) and to arc to be committed later to a pool when txg closes.

In this specific case, there are separate log devices, so logbias doesn't apply.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Richard Elling
On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:

 On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
 client: ubuntu 11.10
 /etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
 bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
0
 
 nfsvers=3
 
 NAME  PROPERTY  VALUE SOURCE
 mainpool/storage  sync  standard  default
 
 sync=standard
 
 This is expected behaviour for this combination. NFS 3 semantics are
 for persistent writes at the server regardless - and mostly also 
 for NFS 4.

NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4
async and sync activity, try nfssvrtop
https://github.com/richardelling/tools

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Richard Elling
Hi Tim,

On Jun 14, 2012, at 12:20 PM, Timothy Coalson wrote:

 Thanks for the script.  Here is some sample output from 'sudo
 ./nfssvrtop -b 512 5' (my disks are 512B-sector emulated and the pool
 is ashift=9, some benchmarking didn't show much difference with
 ashift=12 other than giving up 8% of available space) during a copy
 operation from 37.30 with sync=standard:
 
 2012 Jun 14 13:59:13, load: 0.68, read: 0KB, swrite: 0
 KB, awrite: 557056   KB
 Ver Client   NFSOPS   Reads SWrites AWrites Commits Rd_bw  SWr_bw 
  AWr_bwRd_t   SWr_t   AWr_t   Com_t  Align%
 3   xxx.xxx.37.30   108   0   0 108   00   0  
 111206   0   0 396 1917419 100
 a bit later...
 3   xxx.xxx.37.30   109   0   0 108   00   0  
 111411   0   0 427   0 100
 
 sample output from the end of 'zpool iostat -v 5 mainpool' concurrently:
 logs   -  -  -  -  -  -
  c31t3d0s0 260M  9.68G  0  1.21K  0  85.3M
  c31t4d0s0 260M  9.68G  0  1.21K  0  85.1M
 
 In case the alignment fails, the nonzero entries are under NFSOPS,
 AWrites, AWr_bw, AWr_t, Com_t and Align%.  The Com_t (average commit
 time?) column alternates between zero and a million or two (the other
 columns stay about the same, the zeros stay zero), while the Commits
 column stays zero during the copy.  The write throughput to the logs
 varies quite a bit, that sample is a very high mark, it mainly
 alternates between almost zero and 30M each, which is kind of odd
 considering the copy speed (using gigabit network, copy speed averages
 around 110MB/s).

The client is using async writes, that include commits. Sync writes do not
need commits.

What happens is that the ZFS transaction group commit occurs at more-or-less
regular intervals, likely 5 seconds for more modern ZFS systems. When the 
commit occurs, any data that is in the ARC but not commited in a prior 
transaction
group gets sent to the ZIL. This is why you might see a very different amount of
ZIL activity relative to the expected write workload.

 When I 'zfs set sync=disabled', the output of nfssrvtop stays about
 the same, except the Com_t stays 0, and the log devices also stay 0
 for throughput.  Could you enlighten me as to what Com_t measures
 when Commits stays zero?  Perhaps the nfs server caches asynchronous
 nfs writes how I expect, but flushes its cache with synchronous
 writes?
 

With sync=disabled, the ZIL is not used, thus the commit response to the client
is a lie, breaking the covenant between the server and client. In other words, 
the server is supposed to respond to the commit only when the data is written
to permanent media, but the administrator overruled this action by disabling
the ZIL. If the server was to unexpectedly restart or other conditions occur
such that the write cannot be completed, then the server and client will have
different views of the data, a form of data loss.

Different applications can react to long commit times differently. In this 
example,
we see 1.9 seconds for the commit versus about 400 microseconds for each 
async write. The cause of the latency of the commit is not apparent from any
bandwidth measurements (eg zpool iostat) and you should consider looking 
more closely at the iostat -x latency to see if the log devices are performing
well. 
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Richard Elling
On Jun 11, 2012, at 6:05 AM, Jim Klimov wrote:

 2012-06-11 5:37, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Kalle Anka
 
 Assume we have 100 disks in one zpool. Assume it takes 5 hours to scrub
 one
 disk. If I scrub the zpool, how long time will it take?
 
 Will it scrub one disk at a time, so it will take 500 hours, i.e. in
 sequence, just
 serial? Or is it possible to run the scrub in parallel, so it takes 5h no
 matter
 how many disks?
 
 It will be approximately parallel, because it's actually scrubbing only the
 used blocks, and the order it scrubs in will be approximately the order they
 were written, which was intentionally parallel.
 
 What the other posters said, plus: 100 disks is quite a lot
 of contention on the bus(es), so even if it is all parallel,
 the bus and CPU bottlenecks would raise the scrubbing time
 somewhat above the single-disk scrub time.

In general, this is not true for HDDs or modern CPUs. Modern systems
are overprovisioned for bandwidth. In fact, bandwidth has been a poor
design point for storage for a long time. Dave Patterson has some 
interesting observations on this, now 8 years dated.
http://www.ll.mit.edu/HPEC/agendas/proc04/invited/patterson_keynote.pdf

SSDs tend to be a different story, and there is some interesting work being
done in this area, both on the systems side as well as the SSD side. This is
where the fun work is progressing :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Richard Elling
On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote:

 So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
 to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
 LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
 occasionally seeing a storm of xcalls on one of the 32 VCPUs (10
 xcalls a second).

That isn't much of a storm, I've seen  1M xcalls in some cases...

 The machine is pretty much idle, only receiving a
 bunch of multicast video streams and dumping them to the drives (at a
 rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
 xcalls that completely eat one of the CPUs, so the mpstat line for the
 CPU looks like:
 
 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 310   0 102191 1000000 00 100
 0   0
 
 100% busy in the system processing cross-calls. When I tried dtracing
 this issue, I found that this is the most likely culprit:
 
 dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
   unix`xc_call+0x46
   unix`hat_tlb_inval+0x283
   unix`x86pte_inval+0xaa
   unix`hat_pte_unmap+0xed
   unix`hat_unload_callback+0x193
   unix`hat_unload+0x41
   unix`segkmem_free_vn+0x6f
   unix`segkmem_zio_free+0x27
   genunix`vmem_xfree+0x104
   genunix`vmem_free+0x29
   genunix`kmem_slab_destroy+0x87
   genunix`kmem_slab_free+0x2bb
   genunix`kmem_magazine_destroy+0x39a
   genunix`kmem_depot_ws_reap+0x66
   genunix`taskq_thread+0x285
   unix`thread_start+0x8
 3221701
 
 This happens in the sched (pid 0) process. My fsstat one looks like this:
 
 # fsstat /content 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0   664 0952 0 0 0   664 38.0M /content
0 0 0   658 0935 0 0 0   656 38.6M /content
0 0 0   660 0946 0 0 0   659 37.8M /content
0 0 0   677 0969 0 0 0   676 38.5M /content
 
 What's even more puzzling is that this happens apparently entirely
 because of some factor other than userland, since I see no changes to
 CPU usage of processes in prstat(1M) when this xcall storm happens, only
 an increase of loadavg of +1.00 (the busy CPU).

What exactly is the workload doing?

Local I/O, iSCSI, NFS, or CIFS?

 I Googled and found that
 http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
 seems to have been an issue identical to mine, however, it remains
 unresolved at that time and it worries me about putting this kind of
 machine into production use.
 
 Could some ZFS guru please tell me what's going on in segkmem_zio_free?

It is freeing memory.

 When I disable the writers to the /content filesystem, this issue goes
 away, so it has obviously something to do with disk IO. Thanks!

Not directly related to disk I/O bandwidth. Can be directly related to other
use, such as deletions -- something that causes frees.

Depending on the cause, there can be some tuning that applies for large
memory machines, where large is = 96 MB.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   5   6   7   8   9   10   >