Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-22 Thread Arne Jansen
On 20.10.2012 22:24, Tim Cook wrote:
 
 
 On Sat, Oct 20, 2012 at 2:54 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net wrote:
 
 On 10/20/2012 01:10 AM, Tim Cook wrote:
 
 
  On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net
  mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote:
 
  On 10/19/2012 09:58 PM, Matthew Ahrens wrote:
   On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net
  mailto:sensi...@gmx.net mailto:sensi...@gmx.net
   mailto:sensi...@gmx.net mailto:sensi...@gmx.net
 mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote:
  
   We have finished a beta version of the feature. A webrev for 
 it
   can be found here:
  
   http://cr.illumos.org/~webrev/sensille/fits-send/
  
   It adds a command 'zfs fits-send'. The resulting streams can
   currently only be received on btrfs, but more receivers will
   follow.
   It would be great if anyone interested could give it some 
 testing
   and/or review. If there are no objections, I'll send a formal
   webrev soon.
  
  
  
   Please don't bother changing libzfs (and proliferating the 
 copypasta
   there) -- do it like lzc_send().
  
 
  ok. It would be easier though if zfs_send would also already use the
  new style. Is it in the pipeline already?
 
   Likewise, zfs_ioc_fits_send should use the new-style API.  See the
   comment at the beginning of zfs_ioctl.c.
  
   I'm not a fan of the name FITS but I suppose somebody else 
 already
   named the format.  If we are going to follow someone else's format
   though, it at least needs to be well-documented.  Where can we
  find the
   documentation?
  
   FYI, #1 google hit for FITS:  http://en.wikipedia.org/wiki/FITS
   #3 hit:  http://code.google.com/p/fits/
  
   Both have to do with file formats.  The entire first page of 
 google
   results for FITS format and FITS file format are related to 
 these
   two formats.  FITS btrfs didn't return anything specific to the 
 file
   format, either.
 
  It's not too late to change it, but I have a hard time coming up 
 with
  some better name. Also, the format is still very new and I'm sure 
 it'll
  need some adjustments.
 
  -arne
 
  
   --matt
 
 
 
  I'm sure we can come up with something.  Are you planning on this being
  solely for ZFS, or a larger architecture for replication both directions
  in the future?
 
 We have senders for zfs and btrfs. The planned receiver will be mostly
 filesystem agnostic and can work on a much broader range. It basically
 only needs to know how to create snapshots and where to store a few
 meta informations.
 It would be great if more filesystems would join on the sending side,
 but I have no involvement there.
 
 I see no basic problem in choosing a name that's already in use.
 Especially with file extensions most will be already taken. How about
 something with 'portable' and 'backup', like pib or pibs? 'i' for
 incremental.
 
 -Arne
 
 
 Re-using names generally isn't a big deal, but in this case the existing name 
 is
 a technology that's extremely similar to what you're doing - which WILL cause 
 a
 ton of confusion in the userbase, and make troubleshooting far more difficult
 when searching google/etc looking for links to documents that are applicable. 
  
 
 Maybe something like far - filesystem agnostic replication?   

I like that one. It has a nice connotation to 'remote'. So 'far' it be.
Thanks!

-Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-22 Thread Arne Jansen
On 22.10.2012 06:32, Matthew Ahrens wrote:
 On Sat, Oct 20, 2012 at 1:24 PM, Tim Cook t...@cook.ms 
 mailto:t...@cook.ms wrote:
 
 
 
 On Sat, Oct 20, 2012 at 2:54 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net wrote:
 
 On 10/20/2012 01:10 AM, Tim Cook wrote:
 
 
  On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net
  mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote:
 
  On 10/19/2012 09:58 PM, Matthew Ahrens wrote:
   On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net
  mailto:sensi...@gmx.net mailto:sensi...@gmx.net
   mailto:sensi...@gmx.net mailto:sensi...@gmx.net
 mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote:
  
   We have finished a beta version of the feature. A webrev 
 for it
   can be found here:
  
   http://cr.illumos.org/~webrev/sensille/fits-send/
  
   It adds a command 'zfs fits-send'. The resulting streams 
 can
   currently only be received on btrfs, but more receivers 
 will
   follow.
   It would be great if anyone interested could give it some
 testing
   and/or review. If there are no objections, I'll send a 
 formal
   webrev soon.
  
  
  
   Please don't bother changing libzfs (and proliferating the 
 copypasta
   there) -- do it like lzc_send().
  
 
  ok. It would be easier though if zfs_send would also already 
 use the
  new style. Is it in the pipeline already?
 
   Likewise, zfs_ioc_fits_send should use the new-style API.  
 See the
   comment at the beginning of zfs_ioctl.c.
  
   I'm not a fan of the name FITS but I suppose somebody else 
 already
   named the format.  If we are going to follow someone else's 
 format
   though, it at least needs to be well-documented.  Where can we
  find the
   documentation?
  
   FYI, #1 google hit for FITS:  
 http://en.wikipedia.org/wiki/FITS
   #3 hit:  http://code.google.com/p/fits/
  
   Both have to do with file formats.  The entire first page of 
 google
   results for FITS format and FITS file format are related 
 to
 these
   two formats.  FITS btrfs didn't return anything specific to
 the file
   format, either.
 
  It's not too late to change it, but I have a hard time coming 
 up with
  some better name. Also, the format is still very new and I'm 
 sure
 it'll
  need some adjustments.
 
  -arne
 
  
   --matt
 
 
 
  I'm sure we can come up with something.  Are you planning on this 
 being
  solely for ZFS, or a larger architecture for replication both 
 directions
  in the future?
 
 We have senders for zfs and btrfs. The planned receiver will be mostly
 filesystem agnostic and can work on a much broader range. It basically
 only needs to know how to create snapshots and where to store a few
 meta informations.
 It would be great if more filesystems would join on the sending side,
 but I have no involvement there.
 
 I see no basic problem in choosing a name that's already in use.
 Especially with file extensions most will be already taken. How about
 something with 'portable' and 'backup', like pib or pibs? 'i' for
 incremental.
 
 -Arne
 
 
 Re-using names generally isn't a big deal, but in this case the existing
 name is a technology that's extremely similar to what you're doing - which
 WILL cause a ton of confusion in the userbase, and make troubleshooting 
 far
 more difficult when searching google/etc looking for links to documents 
 that
 are applicable.  
 
 Maybe something like far - filesystem agnostic replication?   
 
 
 All else being equal, I like this name (FAR).  It ends in AR like several
 other archive formats (TAR, WAR, JAR).  Plus not a lot of false positives when
 googling around for it.   
 
 However, if compatibility with the existing format is an explicit goal, we
 should use the same name, and the btrfs authors may be averse to changing the 
 name.

There's really nothing to keep. In the btrfs world, like in the zfs world, the
stream has no special name, it's just a 'btrfs send stream', like the 'zfs send
stream'. The necessity for a name only arises from the wish to build a bridge
between the worlds.
The author

Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-20 Thread Arne Jansen
On 10/20/2012 01:10 AM, Tim Cook wrote:
 
 
 On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net wrote:
 
 On 10/19/2012 09:58 PM, Matthew Ahrens wrote:
  On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net
  mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote:
 
  We have finished a beta version of the feature. A webrev for it
  can be found here:
 
  http://cr.illumos.org/~webrev/sensille/fits-send/
 
  It adds a command 'zfs fits-send'. The resulting streams can
  currently only be received on btrfs, but more receivers will
  follow.
  It would be great if anyone interested could give it some testing
  and/or review. If there are no objections, I'll send a formal
  webrev soon.
 
 
 
  Please don't bother changing libzfs (and proliferating the copypasta
  there) -- do it like lzc_send().
 
 
 ok. It would be easier though if zfs_send would also already use the
 new style. Is it in the pipeline already?
 
  Likewise, zfs_ioc_fits_send should use the new-style API.  See the
  comment at the beginning of zfs_ioctl.c.
 
  I'm not a fan of the name FITS but I suppose somebody else already
  named the format.  If we are going to follow someone else's format
  though, it at least needs to be well-documented.  Where can we
 find the
  documentation?
 
  FYI, #1 google hit for FITS:  http://en.wikipedia.org/wiki/FITS
  #3 hit:  http://code.google.com/p/fits/
 
  Both have to do with file formats.  The entire first page of google
  results for FITS format and FITS file format are related to these
  two formats.  FITS btrfs didn't return anything specific to the file
  format, either.
 
 It's not too late to change it, but I have a hard time coming up with
 some better name. Also, the format is still very new and I'm sure it'll
 need some adjustments.
 
 -arne
 
 
  --matt
 
 
 
 I'm sure we can come up with something.  Are you planning on this being
 solely for ZFS, or a larger architecture for replication both directions
 in the future?

We have senders for zfs and btrfs. The planned receiver will be mostly
filesystem agnostic and can work on a much broader range. It basically
only needs to know how to create snapshots and where to store a few
meta informations.
It would be great if more filesystems would join on the sending side,
but I have no involvement there.

I see no basic problem in choosing a name that's already in use.
Especially with file extensions most will be already taken. How about
something with 'portable' and 'backup', like pib or pibs? 'i' for
incremental.

-Arne


 
 --Tim
  
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-20 Thread Arne Jansen
On 10/20/2012 01:21 AM, Matthew Ahrens wrote:
 On Fri, Oct 19, 2012 at 1:46 PM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net wrote:
 
 On 10/19/2012 09:58 PM, Matthew Ahrens wrote:
  Please don't bother changing libzfs (and proliferating the copypasta
  there) -- do it like lzc_send().
 
 
 ok. It would be easier though if zfs_send would also already use the
 new style. Is it in the pipeline already?
 
  Likewise, zfs_ioc_fits_send should use the new-style API.  See the
  comment at the beginning of zfs_ioctl.c.
 
 
 I'm saying to use lzc_send() as an example, rather than zfs_send().
  lzc_send() already uses the new style.  I don't see how your job would
 be made easier by converting zfs_send().

Yeah, but the zfs util still uses the old version.
 
 It would be nice to convert ZFS_IOC_SEND to the new IOCTL format
 someday, but I don't think that the complexities of zfs_send() would be
 appropriate for libzfs_core.  Programmatic consumers typically know
 exactly what snapshots they want sent and would prefer the clean error
 handling of lzc_send().

What I meant was if you want the full-blown zfs send-functionality with
the ton of options, it would be much easier to reuse the existing logic
and only call *_send_fits instead of *_send when requested.
If you're content with just the -i option I've currently implemented,
it's certainly easy to convert. I on my part have mostly programmatic
consumers.

-Arne

 
 --matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-19 Thread Arne Jansen
On 19.10.2012 10:47, Joerg Schilling wrote:
 Arne Jansen sensi...@gmx.net wrote:
 
 On 10/18/2012 10:19 PM, Andrew Gabriel wrote:
 Arne Jansen wrote:
 We have finished a beta version of the feature.

 What does FITS stand for?

 Filesystem Incremental Transport Stream
 (or Filesystem Independent Transport Stream)
 
 Is this an attempt to create a competition for TAR?

Not really. We'd have preferred tar if it would have been powerful enough.
It's more an alternative to rsync for incremental updates. I really
like the send/receive feature and want to make it available for cross-
platform syncs.

Arne

 
 Jörg
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-19 Thread Arne Jansen
On 19.10.2012 11:16, Irek Szczesniak wrote:
 On Wed, Oct 17, 2012 at 2:29 PM, Arne Jansen sensi...@gmx.net wrote:
 We have finished a beta version of the feature. A webrev for it
 can be found here:

 http://cr.illumos.org/~webrev/sensille/fits-send/

 It adds a command 'zfs fits-send'. The resulting streams can
 currently only be received on btrfs, but more receivers will
 follow.
 It would be great if anyone interested could give it some testing
 and/or review. If there are no objections, I'll send a formal
 webrev soon.
 
 Why are you trying to reinvent the wheel? AFAIK some tar versions and
 ATT AST pax support deltas based on a standard (I'll have to dig out
 the exact specification, but from looking at it you did double work).
 

I haven't done the research myself, but the result was that pax would
have needed significant extension, but I don't have the details. If
you dig out a format already in use that supports everything we need
(like sharing data between files, needed for btrfs reflinks), it should
be easy to change the format. Stuffing the data into a specific format
is not an essential part of the work and can be changed with a limited
amount of work.

-Arne

 Irek

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-19 Thread Arne Jansen
On 19.10.2012 12:17, Joerg Schilling wrote:
 Arne Jansen sensi...@gmx.net wrote:
 
 Is this an attempt to create a competition for TAR?

 Not really. We'd have preferred tar if it would have been powerful enough.
 It's more an alternative to rsync for incremental updates. I really
 like the send/receive feature and want to make it available for cross-
 platform syncs.
 
 TAR with the star extensions that are also implemented by many other recent 
 TAR 
 programs should do, what are you missing?

As said I've not done the research myself, but operations that come to mind
include
 - partial updates of files
 - sparse files
 - punch hole
 - truncate
 - rename
 - referencing parts of other files as the data to write (reflinks)
 - create snapshot

Do star support these operation? Are they part of any standard?

Also, are chmod/chown/set atime/mtime possible on existing files?

I'd be happy to find a well maintained receiver/format to send the
data in, this way I don't need to care about a cross-platform
receiver :)

-Arne

 
 Jörg
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-19 Thread Arne Jansen
On 19.10.2012 13:53, Joerg Schilling wrote:
 Arne Jansen sensi...@gmx.net wrote:
 
 On 19.10.2012 12:17, Joerg Schilling wrote:
 Arne Jansen sensi...@gmx.net wrote:

 Is this an attempt to create a competition for TAR?

 Not really. We'd have preferred tar if it would have been powerful enough.
 It's more an alternative to rsync for incremental updates. I really
 like the send/receive feature and want to make it available for cross-
 platform syncs.

 TAR with the star extensions that are also implemented by many other recent 
 TAR 
 programs should do, what are you missing?

 As said I've not done the research myself, but operations that come to mind
 include
  - partial updates of files
 
 How do you intend to detect this _after_ the original file was updated?

the patch is a kernel-side patch. It detects it by efficiently comparing
zfs snapshots. If you change a block in a file, we'll send only the changed
block.

 
  - sparse files
 
 supported in an efficient way by star
 
  - punch hole
 
 As this is a specific case of a sparse file, it could be added

While 'sparse file' meant the initial transport of a sparse file, 'punch hole'
is meant in the incremental context. Linux supports punching holes in files.

 
  - truncate
 
 see above
 
  - rename
 
 part of the incremental restore architecture from star, but needs a restore 
 symbol table on the receiving system

Here, the sending sides determines renames.

 
  - referencing parts of other files as the data to write (reflinks)
 
 There is no user space  interface to detect this, why do you need it?

As above, it is detected in kernel. In the case of btrfs, each datablock
has backreferences to all its users. In zfs, the problem does not exist
in this form, nevertheless a common format has to be able to transport
this.

 
  - create snapshot
 
 star supports incrementals. or do you mean that a snapshot should be set up 
 on 
 the reeiving site?

snapshot on the receiving side. The aim is to exactly replicate a number of
snapshots, just as zfs send/receive does.

Alexander Block (the author of btrfs send/receive) commented in another
subthread and explained the motivation to move away from tar/pax.

-Arne

 
 Do star support these operation? Are they part of any standard?

 Also, are chmod/chown/set atime/mtime possible on existing files?
 
 star allows to call:
 
   star -x -xmeta
 
 to _only_ extract meta data from a normal tar archive and it allows to create 
 a 
 specific meta data only archve via star -c -meta
 
 Jörg
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-19 Thread Arne Jansen
On 10/19/2012 09:58 PM, Matthew Ahrens wrote:
 On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net
 mailto:sensi...@gmx.net wrote:
 
 We have finished a beta version of the feature. A webrev for it
 can be found here:
 
 http://cr.illumos.org/~webrev/sensille/fits-send/
 
 It adds a command 'zfs fits-send'. The resulting streams can
 currently only be received on btrfs, but more receivers will
 follow.
 It would be great if anyone interested could give it some testing
 and/or review. If there are no objections, I'll send a formal
 webrev soon.
 
 
 
 Please don't bother changing libzfs (and proliferating the copypasta
 there) -- do it like lzc_send().
 

ok. It would be easier though if zfs_send would also already use the
new style. Is it in the pipeline already?

 Likewise, zfs_ioc_fits_send should use the new-style API.  See the
 comment at the beginning of zfs_ioctl.c.
 
 I'm not a fan of the name FITS but I suppose somebody else already
 named the format.  If we are going to follow someone else's format
 though, it at least needs to be well-documented.  Where can we find the
 documentation?
 
 FYI, #1 google hit for FITS:  http://en.wikipedia.org/wiki/FITS
 #3 hit:  http://code.google.com/p/fits/
 
 Both have to do with file formats.  The entire first page of google
 results for FITS format and FITS file format are related to these
 two formats.  FITS btrfs didn't return anything specific to the file
 format, either.

It's not too late to change it, but I have a hard time coming up with
some better name. Also, the format is still very new and I'm sure it'll
need some adjustments.

-arne

 
 --matt
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-18 Thread Arne Jansen
On 10/18/2012 10:19 PM, Andrew Gabriel wrote:
 Arne Jansen wrote:
 We have finished a beta version of the feature.
 
 What does FITS stand for?

Filesystem Incremental Transport Stream
(or Filesystem Independent Transport Stream)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-17 Thread Arne Jansen
We have finished a beta version of the feature. A webrev for it
can be found here:

http://cr.illumos.org/~webrev/sensille/fits-send/

It adds a command 'zfs fits-send'. The resulting streams can
currently only be received on btrfs, but more receivers will
follow.
It would be great if anyone interested could give it some testing
and/or review. If there are no objections, I'll send a formal
webrev soon.

Thanks,
Arne


On 10.10.2012 21:38, Arne Jansen wrote:
 Hi,
 
 We're currently working on a feature to send zfs streams in a portable
 format that can be received on any filesystem.
 It is a kernel patch that generates the stream directly from kernel,
 analogous to what zfs send does.
 The stream format is based on the btrfs send format. The basic idea
 is to just send commands like mkdir, unlink, create, write, etc.
 For incremental sends it's the same idea.
 The receiving side is user mode only, so it's very easy to port it to
 any system. If the receiving side has the capability to create
 snapshots, those from the sending side will get recreated. If not,
 you still have the benefit of fast incremental transfers.
 
 My question is if there's any interest in this feature and if it had
 chances of getting accepted. Also, would it be acceptable to start with
 a working version and add performance optimizations later on?
 
 -Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen
Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) blocks
until the txg is flushed. This means a write takes up to 30 seconds. During
this time, the nfs calls block, occupying all NFS server threads. With all
server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server effectively
down to zero.
It may be that the trigger for this behavior is around 95%. I managed to bring
the pool down to 95%, now the writes get served continuously as it should be.

What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to 
achieve this.

If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 
30 seconds.
I would expect mush less, but finding room for the rest of the txg data 
and metadata

would also be a challenge.


I think this is not what we saw, for two reason:
 a) we have a mirrored slog device. According to zpool iostat -v only 16MB
out of 4GB were in use.
 b) it didn't seem like the txg would have been closed early. Rather it kept
approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.
Also this sudden switch of behavior is completely unexpected and at least under-
documented.



Most (maybe all?) file systems perform badly when out of space. I 
believe we give a recommended

free size and I thought it was 90%.


In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.

--
Arne




Neil.

On 09/09/10 09:00, Arne Jansen wrote:

Hi,

currently I'm trying to debug a very strange phenomenon on a nearly full
pool (96%). Here are the symptoms: over NFS, a find on the pool takes
a very long time, up to 30s (!) for each file. Locally, the performance
is quite normal.
What I found out so far: It seems that every nfs write (rfs3_write) 
blocks
until the txg is flushed. This means a write takes up to 30 seconds. 
During
this time, the nfs calls block, occupying all NFS server threads. With 
all

server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait
until the writes finish, bringing the performance of the server 
effectively

down to zero.
It may be that the trigger for this behavior is around 95%. I managed 
to bring
the pool down to 95%, now the writes get served continuously as it 
should be.


What is the explanation for this behaviour? Is it intentional and can the
threshold be tuned? I experienced this on Sol10 U8.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance near zero on a very full pool

2010-09-09 Thread Arne Jansen

Richard Elling wrote:

On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote:


Hi Neil,

Neil Perrin wrote:

NFS often demands it's transactions are stable before returning.
This forces ZFS to do the system call synchronously. Usually the
ZIL (code) allocates and writes a new block in the intent log chain to achieve 
this.
If ever it fails to allocate a block (of the size requested) it it forced
to close the txg containing the system call. Yes this can be extremely
slow but there is no other option for the ZIL. I'm surprised the wait is 30 
seconds.
I would expect mush less, but finding room for the rest of the txg data and 
metadata
would also be a challenge.

I think this is not what we saw, for two reason:
a) we have a mirrored slog device. According to zpool iostat -v only 16MB
   out of 4GB were in use.
b) it didn't seem like the txg would have been closed early. Rather it kept
   approximately the 30 second intervals.

Internally we came up with a different explanation, without any backing that
it might be correct: When the pool reaches 96%, zfs goes into a 'self defense'
mode. Instead of allocating block from ZIL, every write turns synchronous and
has to wait for the txg to finish naturally. The reasoning behind this might
be that even if ZIL is available, there might not be enough space left to commit
the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above
96%. While this might be proper for small pools, on large pools 4% are still
some TB of free space, so there should be an upper limit of maybe 10GB on this
hidden reserve.


I do not believe this is correct.  At 96% the first-fit algorithm changes to 
best-fit
and ganging can be expected. This has nothing to do with the ZIL.  There is 
already a reserve set aside for metadata and the ZIL so that you can remove

files when the file system is 100% full.  This reserve is 32 MB or 1/64 of the 
pool
size.


Maybe it is some side-effect of this change of allocation scheme. But I'm very
sure about what I saw. The change was drastic and abrupt. I had a dtrace script
running that measured the time for rfs3_write to complete. With the pool 96%
I saw a burst of writes every 30 seconds, with completion times of up to 30s.
With the pool  96%, I saw a continuous stream of writes with completion times
of mostly a few microseconds.




In this situation, not only writes suffered, but as a side effect reads also
came to a nearly complete halt.


If you have atime=on, then reads create writes.


atime is off. The impact on reads/lookups/getattr came imho because all server
threads have been occupied by blocking writes for a prolonged time.

I'll try to reproduce this on a test machine.

--
Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what is zfs doing during a log resilver?

2010-09-05 Thread Arne Jansen

Giovanni Tirloni wrote:



On Thu, Sep 2, 2010 at 10:18 AM, Jeff Bacon ba...@walleyesoftware.com 
mailto:ba...@walleyesoftware.com wrote:


So, when you add a log device to a pool, it initiates a resilver.

What is it actually doing, though? Isn't the slog a copy of the
in-memory intent log? Wouldn't it just simply replicate the data that's
in the other log, checked against what's in RAM? And presumably there
isn't that much data in the slog so there isn't that much to check?

Or is it just doing a generic resilver for the sake of argument because
you changed something?


Good question. Here it takes little over 1 hour to resilver a 32GB SSD 
in a mirror. I've always wondered what exactly it was doing since it was 
supposed to be 30 seconds worth of data. It also generates lots of 
checksum errors.


Here it takes more than 2 days to resilver a failed slog-SSD. I'd also
expect it to finish in a few seconds... It seems it resilvers the whole pool,
35T worth of data on 22 spindels (RAID-Z2).

We don't get any errors during resilver.

--
Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-23 Thread Arne Jansen

Edward Ned Harvey wrote:

From: Robert Milkowski [mailto:mi...@task.gda.pl]

[In raidz] The issue is that each zfs filesystem block is basically 
spread across

n-1 devices.
So every time you want to read back a single fs block you need to wait
for all n-1 devices to provide you with a part of it - and keep in mind
in zfs you can't get a partial block even if that's what you are asking
for as zfs has to check checksum of entire fs block.


Can anyone else confirm or deny the correctness of this statement?

If you read a small file from a raidz volume, do you have to wait for every
single disk to return a small chunk of the blocksize?  I know this is true
for large files which require more than one block, obviously, but even a
small file gets spread out across multiple disks?

This may be the way it's currently implemented, but it's not a mathematical
requirement.  It is possible, if desired, to implement raid parity and still
allow small files to be written entirely on a single disk, without losing
redundancy.  Thus providing the redundancy, the large file performance,
(both of which are already present in raidz), and also optimizing small file
random operations, which may not already be optimized in raidz.


As I understand it that's the whole point of raidz. Each block is its own
stripe. If necessary the block gets broken down into 512 byte chunks to spread
it as wide as possible. Each block gets its own parity added. So if the array
is too wide for the block to be spread to all disks, you also lose space because
the stripe is not full and parity gets added to that small stripe. That means
if you only write 512 byte blocks, each write writes 3 blocks to disk, so the
net capacity goes down to one third, regardless how many disks you have in your
raid group.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-16 Thread Arne Jansen

Jordan McQuown wrote:
I’m curious to know what other people are running for HD’s in white box 
systems? I’m currently looking at Seagate Barracuda’s and Hitachi 
Deskstars. I’m looking at the 1tb models. These will be attached to an 
LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This 
system will be used as a large storage array for backups and archiving.


I wouldn't recommend using desktop drives in a server RAID. They can't
handle the vibrations well that are present in a server. I'd recommend
at least the Seagate Constellation or the Hitachi Ultrastar, though I
haven't tested the Deskstar myself.

--Arne

 
Thanks,

Jordan
 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-06 Thread Arne Jansen
Daniel Carosone wrote:
 Something similar would be useful, and much more readily achievable,
 from ZFS from such an application, and many others.  Rather than a way
 to compare reliably between two files for identity, I'ld liek a way to
 compare identity of a single file between two points in time.  If my
 application can tell quickly that the file content is unaltered since
 last time I saw the file, I can avoid rehashing the content and use a
 stored value. If I can achieve this result for a whole directory
 tree, even better.

This would be great for any kind of archiving software. Aren't zfs checksums
already ready to solve this? If a file changes, it's dnodes' checksum changes,
the checksum of the directory it is in and so forth all the way up to the
uberblock.
There may be ways a checksum changes without a real change in the files content,
but the other way round should hold. If the checksum didn't change, the file
didn't change.
So the only missing link is a way to determine zfs's checksum for a
file/directory/dataset. Am I missing something here? Of course atime update
should be turned off, otherwise the checksum will get changed by the archiving
agent.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What happens when unmirrored ZIL log device is removed ungracefully

2010-06-29 Thread Arne Jansen
Edward Ned Harvey wrote:
 Due to recent experiences, and discussion on this list, my colleague and
 I performed some tests:
 
 Using solaris 10, fully upgraded.  (zpool 15 is latest, which does not
 have log device removal that was introduced in zpool 19)  In any way
 possible, you lose an unmirrored log device, and the OS will crash, and
 the whole zpool is permanently gone, even after reboots.
 

I'm a bit confused. I tried hard, but haven't been able to reproduce this
using Sol10U8. I have a mirrored slog device. While putting it
under load doing synchronous file creations, we pulled the power cords
and unplugged the slog devices. After powering on zfs imported the pool,
but prompted to acknowledge the missing slog devices with zpool clear.
After that the pool was accessible again. That's exactly how it should be.

What am I doing wrong here? The system is on a different pool using different
disks.

One peculiarity I noted though: when pulling both slog devices from the running
machine, zpool status reports 1 file error. In my understanding this should
not happen as the file data is written from memory and not from the contents
of the zil. It seems the reported write error from the slog device somehow
lead to a corrupted file.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OCZ Vertex 2 Pro performance numbers

2010-06-26 Thread Arne Jansen

Geoff Nordli wrote:


Is this the one
(http://www.ocztechnology.com/products/solid-state-drives/2-5--sata-ii/maxim
um-performance-enterprise-solid-state-drives/ocz-vertex-2-pro-series-sata-ii
-2-5--ssd-.html) with the built in supercap? 



Yes.

Geoff 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] OCZ Vertex 2 Pro performance numbers

2010-06-25 Thread Arne Jansen
Now the test for the Vertex 2 Pro. This was fun.
For more explanation please see the thread Crucial RealSSD C300 and cache
flush?
This time I made sure the device is attached via 3GBit SATA. This is also
only a short test. I'll retest after some weeks of usage.

cache enabled, 32 buffers, 64k blocks
linear write, random data: 96 MB/s
linear read, random data: 206 MB/s
linear write, zero data: 234 MB/s
linear read, zero data: 255 MB/s
random write, random data: 84 MB/s
random read, random data: 180 MB/s
random write, zero data: 224 MB/s
randow read, zero data: 190 MB/s

cache enabled, 32 buffers, 4k blocks
linear write, random data: 93 MB/s
linear read, random data: 138 MB/s
linear write, zero data: 113 MB/s
linear read, zero data: 141 MB/s
random write, random data: 41 MB/s (10300 ops/s)
random read, random data: 76 MB/s (19000 ops/s)
random write, zero data: 54 MB/s (13800 ops/s)
random read, zero data: 91 MB/s (22800 ops/s)


cache enabled, 1 buffer, 4k blocks
linear write, random data: 62 MB/s (15700 ops/s)
linear read, random data: 32 MB/s (8000 ops/s)
linear write, zero data: 64 MB/s (16100 ops/s)
linear read, zero data: 45 MB/s (11300 ops/s)
random write, random data: 14 MB/s (3400 ops/s)
random read, random data: 22 MB/s (5600 ops/s)
random write, zero data: 19 MB/s (4500 ops/s)
random read, zero data: 21 MB/s (5100 ops/s)

cache enabled, 1 buffer, 4k blocks, with cache flushes:
linear write, random data, flush after every write: 5700 ops/s
linear write, zero data, flush after every write: 5700 ops/s
linear write, random data, flush after every 4th write: 8500 ops/s
linear write, zero data, flush after every 4th write: 8500 ops/s

Some remarks:

The random op numbers have to be read with care:
 - reading occurs in the same order as the writing before
 - the ops are not aligned to any specific boundary

The device also passed the write-loss-test: after 5 repeats no
data has been lost.

It doesn't make any difference if the cache is enabled or disabled, so
it might be worth to tune zfs to not issue cache flushes.

Conclusion: This device will make an excellent slog device. I'll order
them today ;)

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Hi,

Roy Sigurd Karlsbakk wrote:
 Crucial RealSSD C300 has been released and showing good numbers for use as 
 Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as 
 opposed to Intel units etc?
 

I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did
some quick testing. Here are the numbers first, some explanation follows below:

cache enabled, 32 buffers:
Linear read, 64k blocks: 134 MB/s
random read, 64k blocks: 134 MB/s
linear read, 4k blocks: 87 MB/s
random read, 4k blocks: 87 MB/s
linear write, 64k blocks: 107 MB/s
random write, 64k blocks: 110 MB/s
linear write, 4k blocks: 76 MB/s
random write, 4k blocks: 32 MB/s

cache enabled, 1 buffer:
linear write, 4k blocks: 51 MB/s (12800 ops/s)
random write, 4k blocks: 7 MB/s (1750 ops/s)
linear write, 64k blocks: 106 MB/s (1610 ops/s)
random write, 64k blocks: 59 MB/s (920 ops/s)

cache disabled, 1 buffer:
linear write, 4k blocks: 4.2 MB/s (1050 ops/s)
random write, 4k blocks: 3.9 MB/s (980 ops/s)
linear write, 64k blocks: 40 MB/s (650 ops/s)
random write, 64k blocks: 40 MB/s (650 ops/s)

cache disabled, 32 buffers:
linear write, 4k blocks: 4.5 MB/s, 1120 ops/s
random write, 4k blocks: 4.2 MB/s, 1050 ops/s
linear write, 64k blocks: 43 MB/s, 680 ops/s
random write, 64k blocks: 44 MB/s, 690 ops/s

cache enabled, 1 buffer, with cache flushes
linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s
linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s


The numbers are rough numbers read quickly from iostat, so please don't
multiply block size by ops and compare with the bandwidth given ;)
The test operates directly on top of LDI, just like ZFS.
 - nk blocks means the size of each read/write given to the device driver
 - n buffers means the number of buffers I keep in flight. This is to keep
   the command queue of the device busy
 - cache flush means a synchronous ioctl DKIOCFLUSHWRITECACHE

These numbers contain a few surprises (at least for me). The biggest surprise
is that with cache disabled one cannot get good data rates with small blocks,
even if one keeps the command queue filled. This is completely different from
what I've seen from hard drives.
Also the IOPS with cache flushes is quite low, 385 is not much better than
a 15k hdd, while the latter scales better. On the other hand, from the large
drop in performance when using flushes one could infer that they indeed flush
properly, but I haven't built a test setup for that yet.

Conclusion: From the measurements I'd infer the device makes a good L2ARC,
but for a slog device the latency is too high and it doesn't scale well.

I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive.

If there are numbers you are missing please tell me, I'll measure them if
possible. Also please ask if there are questions regarding the test setup.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Arne Jansen wrote:
 Hi,
 
 Roy Sigurd Karlsbakk wrote:
 Crucial RealSSD C300 has been released and showing good numbers for use as 
 Zil and L2ARC. Does anyone know if this unit flushes its cache on request, 
 as opposed to Intel units etc?

 
 I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and 
 did
 some quick testing. Here are the numbers first, some explanation follows 
 below:

After taemun alerted my that the linear read/write numbers are too low I found a
bottleneck: the controller decided to connect the SSD with only 1.5GBit. I have
to check if we can jumper it to least 3GBit. To connect it with 6GBit we need
some new cables, so this might take some time.
The main purpose of this test was to evaluate the SSD with respect to usage as
a slog device and I think the connection speed doesn't affect this. Nevertheless
I'll repeat the tests as soon as we solved the issues.

Sorry.

--Arne

 
 cache enabled, 32 buffers:
 Linear read, 64k blocks: 134 MB/s
 random read, 64k blocks: 134 MB/s
 linear read, 4k blocks: 87 MB/s
 random read, 4k blocks: 87 MB/s
 linear write, 64k blocks: 107 MB/s
 random write, 64k blocks: 110 MB/s
 linear write, 4k blocks: 76 MB/s
 random write, 4k blocks: 32 MB/s
 
 cache enabled, 1 buffer:
 linear write, 4k blocks: 51 MB/s (12800 ops/s)
 random write, 4k blocks: 7 MB/s (1750 ops/s)
 linear write, 64k blocks: 106 MB/s (1610 ops/s)
 random write, 64k blocks: 59 MB/s (920 ops/s)
 
 cache disabled, 1 buffer:
 linear write, 4k blocks: 4.2 MB/s (1050 ops/s)
 random write, 4k blocks: 3.9 MB/s (980 ops/s)
 linear write, 64k blocks: 40 MB/s (650 ops/s)
 random write, 64k blocks: 40 MB/s (650 ops/s)
 
 cache disabled, 32 buffers:
 linear write, 4k blocks: 4.5 MB/s, 1120 ops/s
 random write, 4k blocks: 4.2 MB/s, 1050 ops/s
 linear write, 64k blocks: 43 MB/s, 680 ops/s
 random write, 64k blocks: 44 MB/s, 690 ops/s
 
 cache enabled, 1 buffer, with cache flushes
 linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s
 linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s
 
 
 The numbers are rough numbers read quickly from iostat, so please don't
 multiply block size by ops and compare with the bandwidth given ;)
 The test operates directly on top of LDI, just like ZFS.
  - nk blocks means the size of each read/write given to the device driver
  - n buffers means the number of buffers I keep in flight. This is to keep
the command queue of the device busy
  - cache flush means a synchronous ioctl DKIOCFLUSHWRITECACHE
 
 These numbers contain a few surprises (at least for me). The biggest surprise
 is that with cache disabled one cannot get good data rates with small blocks,
 even if one keeps the command queue filled. This is completely different from
 what I've seen from hard drives.
 Also the IOPS with cache flushes is quite low, 385 is not much better than
 a 15k hdd, while the latter scales better. On the other hand, from the large
 drop in performance when using flushes one could infer that they indeed flush
 properly, but I haven't built a test setup for that yet.
 
 Conclusion: From the measurements I'd infer the device makes a good L2ARC,
 but for a slog device the latency is too high and it doesn't scale well.
 
 I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive.
 
 If there are numbers you are missing please tell me, I'll measure them if
 possible. Also please ask if there are questions regarding the test setup.
 
 --
 Arne
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread Arne Jansen
Arne Jansen wrote:
 Hi,
 
 Roy Sigurd Karlsbakk wrote:
 Crucial RealSSD C300 has been released and showing good numbers for use as 
 Zil and L2ARC. Does anyone know if this unit flushes its cache on request, 
 as opposed to Intel units etc?

 
 Also the IOPS with cache flushes is quite low, 385 is not much better than
 a 15k hdd, while the latter scales better. On the other hand, from the large
 drop in performance when using flushes one could infer that they indeed flush
 properly, but I haven't built a test setup for that yet.
 

Result from cache flush test: While doing synchronous writes with full speed
we pulled the device from the system and compared the contents afterwards.
Result: no writes lost. We repeated the test several times.
Cross check: we pulled also while writing with cache enabled, and it lost
8 writes.

So I'd say, yes, it flushes its cache on request.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Arne Jansen

Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.



I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One dataset per user?

2010-06-22 Thread Arne Jansen

Paul B. Henson wrote:

On Sun, 20 Jun 2010, Arne Jansen wrote:


In my experience the boot time mainly depends on the number of datasets,
not the number of snapshots. 200 datasets is fairly easy (we have 7000,
but did some boot-time tuning).


What kind of boot tuning are you referring to? We've got about 8k
filesystems on an x4500, it takes about 2 hours for a full boot cycle which
is kind of annoying. The majority of that time is taken up with NFS
sharing, which currently scales very poorly :(.


As you said most of the time is spent for nfs sharing, but mounting also isn't
as fast as it could be. We found that the zfs utility is very inefficient as
it does a lot of unnecessary and costly checks. We set mountpoint to legacy
and handle mounting/sharing ourselves in a massively parallel fashion (50
processes). Using the system utilities makes things a lot better, but you
can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK environment
variable before invoking share(1M). With this you should be able to share your
tree in about 10 seconds.

Good luck,
Arne



Thanks...




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One dataset per user?

2010-06-22 Thread Arne Jansen

Arne Jansen wrote:

Paul B. Henson wrote:

On Sun, 20 Jun 2010, Arne Jansen wrote:


In my experience the boot time mainly depends on the number of datasets,
not the number of snapshots. 200 datasets is fairly easy (we have 7000,
but did some boot-time tuning).


What kind of boot tuning are you referring to? We've got about 8k
filesystems on an x4500, it takes about 2 hours for a full boot cycle 
which

is kind of annoying. The majority of that time is taken up with NFS
sharing, which currently scales very poorly :(.


As you said most of the time is spent for nfs sharing, but mounting also 
isn't
as fast as it could be. We found that the zfs utility is very 
inefficient as

it does a lot of unnecessary and costly checks. We set mountpoint to legacy
and handle mounting/sharing ourselves in a massively parallel fashion (50
processes). Using the system utilities makes things a lot better, but you
can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK 
environment
variable before invoking share(1M). With this you should be able to 
share your

tree in about 10 seconds.


I forgot the disclaimer: you can crash your machine if you call share with
improper arguments if you set this flag. iirc it skips a check if the fs
is already shared, so it cannot handle a re-share properly.



Good luck,
Arne



Thanks...




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One dataset per user?

2010-06-21 Thread Arne Jansen
David Magda wrote:
 On Jun 21, 2010, at 05:00, Roy Sigurd Karlsbakk wrote:
 
 So far the plan is to keep it in one pool for design and
 administration simplicity. Why would you want to split up (net) 40TB
 into more pools? Seems to me that'll mess up things a bit, having to
 split up SSDs for use on different pools, loosing the flexibility of a
 common pool etc. Why?
 
 If different groups or areas have different I/O characteristics for one.
 If in one case (users) you want responsiveness, you could go with
 striped-mirrors. However, if departments have lots of data, it may be
 worthwhile to put it on a RAID-Z pool for better storage efficiency.
 

Especially if the characteristics are different I find it a good idea
to mix all on one set of spindles. This way you have lots of spindles
for fast access and lots of space for the sake of space. If you devide
the available spindles in two sets you will have much fewer spindles
available for the responsiveness goal. I don't think taking them into
a mirror can compensate that.

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SLOG striping?

2010-06-21 Thread Arne Jansen

Roy Sigurd Karlsbakk wrote:

Hi all

I plan to setup a new system with four Crucial RealSSD 256MB SSDs for both SLOG 
and L2ARC. The plan is to use four small slices for the SLOG, striping two 
mirrors. I have seen questions in here about the theoretical benefit of doing 
this, but I haven't seen any answers, just some doubt about the effect.

Does anyone know if this will help gaining performance? Or will it be bad?


I'm planning to do something similar, though I only want to install 2 devices.
Some thoughts I had so far:

 - mirroring l2arc won't gain anything, as it doesn't contain any information
   that cannot be rebuilt if a device is lost. Further, if a device is lost,
   the system just uses the remaining devices. So I wouldn't waste any space
   mirroring l2arc, I'll just stripe them.
 - the purpose of a zil device is to reduce latency. Throughput is probably not
   an issue, especially if you configure your pool so that large writes go to
   the main pool. As 2 devices don't have a lower latency than one, I see no
   real point in striping slog devices.
 - For slog you need SSD with supercap which are significantly more expensive
   than without. I'll try the OCZ Vertex 2 Pro in the next few days and can
   give a report how it performs. For L2ARC cheap MLC SSDs will do.

So if I had the chance to buy 4 devices, I'd probably buy 2 different sets.
2 cheap large L2ARC devices, 2 fast supercapped small ones. The 2 slog devices
would go into a mirror, the L2ARC devices in a stripe. I'd probably take the
remaining space of the slog devices into the stripe, too, though this might
affect write performance.

Just me thoughts...

--
Arne



Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SLOG striping?

2010-06-21 Thread Arne Jansen

Roy Sigurd Karlsbakk wrote:

- mirroring l2arc won't gain anything, as it doesn't contain any
information that cannot be rebuilt if a device is lost. Further, if a
device is lost,
the system just uses the remaining devices. So I wouldn't waste any
space mirroring l2arc, I'll just stripe them.


I don't plan to attempt to mirror L2ARC. Even the docs say it's unsupported, so 
no point of that.


Oops, makes sense ;)




- For slog you need SSD with supercap which are significantly more
expensive than without. I'll try the OCZ Vertex 2 Pro in the next few
days and can
give a report how it performs. For L2ARC cheap MLC SSDs will do.


hm... Last I checked those OCZ Vertexes were on the both large and expensive 
side. What do you pay for a couple of small ones? We'll be installing 48 gigs 
of memory in this box, but I doubt we'll need more than 4GB SLOG in terms of 
traffic.


50GB for 400 Euro. They are MLC flash, but, as someone in a different
thread pointed out, they have 3 years warranty ;) My hope is that they
last long enough until cheaper options become available. My major concern
is that if I buy two identical models they'll break the same day. This
is not purely hypothetical. If they internally just count the write cycles
and trigger a SMART fail if a certain threshold is reached, exactly this
will happen.


--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] does sharing an SSD as slog and l2arc reduces its life span?

2010-06-21 Thread Arne Jansen

Wes Felter wrote:

On 6/19/10 3:56 AM, Arne Jansen wrote:

while
thinking about using the OCZ Vertex 2 Pro SSD (which according
to spec page has supercaps built in) as a shared slog and L2ARC
device


IMO it might be better to use the smallest (50GB, maybe overprovisioned 
down to ~20GB) Vertex 2 Pro as slog and a much cheaper SSD (X25-M) as 
L2ARC.




No budget for this. Lucky if I can get the budget for the Vertex 2 Pro.
But if this sharing works (thanks to static wear leveling) it should be
sufficient to leave 10-20% space.


As Bob Friesenhahn said, you're assuming dynamic wear leveling but 
modern SSDs also use static wear leveling, so this problem doesn't 
exist. (Note that in this context the terms dynamic and static may 
not mean what you think they mean.)


Thanks for the term. Yes, this makes sense.

--
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] One dataset per user?

2010-06-20 Thread Arne Jansen

Roy Sigurd Karlsbakk wrote:


I have read people are having problems with lengthy boot times with lots of 
datasets. We're planning to do extensive snapshotting on this system, so there 
might be close to a hundred snapshots per dataset, perhaps more. With 200 users 
and perhaps 10-20 shared department datasets, the number of filesystems, 
snapshots included, will be around 20k or more.


In my experience the boot time mainly depends on the number of datasets, not the
number of snapshots. 200 datasets is fairly easy (we have 7000, but did
some boot-time tuning).



Will trying such a setup be betting on help from some god, or is it doable? The 
box we're planning to use will have 48 gigs of memory and about 1TB L2ARC 
(shared with SLOG, we just use some slices for that).


Try. The main problem with having many snapshots is the time used for zfs list,
because it has to scrape all the information from disk, but with having so much
RAM/L2ARC that shouldn't be a problem here.
Another thing to consider is the frequency with which you plan to take the snap-
shots and if you want individual schedules for each dataset. Taking a snapshot
is a heavy-weight operation as it terminates the current txg.

Btw, what did you plan to use as L2ARC/slog?

--Arne





Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] does sharing an SSD as slog and l2arc reduces its life span?

2010-06-19 Thread Arne Jansen

Hi,

I don't know if it's already been discussed here, but while
thinking about using the OCZ Vertex 2 Pro SSD (which according
to spec page has supercaps built in) as a shared slog and L2ARC
device it stroke me that this might not be a such a good idea.
Because this SSD is MLC based, write cycles are an issue here,
though I can't find any number in their spec.
Why do I think it might be a bad idea: L2ARC is quite static
in comparison with ZIL and L2ARC takes all the place it can get.
But if 90% of the device are nearly statically allocated, the
devices possibilities for wear-leveling are very restricted.
If the ZIL is heavily used, the same 10% of the device get
written over and over again, reducing the life span by 90%.

Is there some fundamental flaw in this line of thought?

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Erratic behavior on 24T zpool

2010-06-18 Thread Arne Jansen
Curtis E. Combs Jr. wrote:
 Sure. And hey, maybe I just need some context to know what's normal
 IO for the zpool. It just...feels...slow, sometimes. It's hard to
 explain. I attached a log of iostat -xn 1 while doing mkfile 10g
 testfile on the zpool, as well as your dd with the bs set really high.
 When I Ctl-C'ed the dd it said 460M/seclike I said, maybe I just
 need some context...
 

These iostats don't match to the creation of any large files. What are
you doing there? Looks more like 512 byte random writes... Are you
generating the load locally or remote?

 
 On Fri, Jun 18, 2010 at 5:36 AM, Arne Jansen sensi...@gmx.net wrote:
 artiepen wrote:
 40MB/sec is the best that it gets. Really, the average is 5. I see 4, 5, 2, 
 and 6 almost 10x as many times as I see 40MB/sec. It really only bumps up 
 to 40 very rarely.

 As far as random vs. sequential. Correct me if I'm wrong, but if I used dd 
 to make files from /dev/zero, wouldn't that be sequential? I measure with 
 zpool iostat 2 in another ssh session while making files of various sizes.

 This is a test system. I'm wondering, now, if I should just reconfigure 
 with maybe 7 disks and add another spare. Seems to be the general consensus 
 that bigger raid pools = worse performance. I thought the opposite was 
 true...
 A quick test on a system with 21 1TB SATA-drives in a single
 RAIDZ2 group show a performance of about 400MB/s with a
 single dd, blocksize=1048576. Creating a 10G-file with mkfile
 takes 25 seconds also.
 So I'd say basically there is nothing wrong with the zpool
 configuration. Can you paste some iostat -xn 1 output while
 your test is running?

 --Arne

 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Erratic behavior on 24T zpool

2010-06-18 Thread Arne Jansen
Curtis E. Combs Jr. wrote:
 Um...I started 2 commands in 2 separate ssh sessions:
 in ssh session one:
 iostat -xn 1  stats
 in ssh session two:
 mkfile 10g testfile
 
 when the mkfile was finished i did the dd command...
 on the same zpool1 and zfs filesystem..that's it, really

No, this doesn't match. Did you enable compression or dedup?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VXFS to ZFS Quota

2010-06-18 Thread Arne Jansen
David Magda wrote:
 On Fri, June 18, 2010 08:29, Sendil wrote:
 
 I can create 400+ file system for each users,
 but will this affect my system performance during the system boot up?
 Is this recommanded or any alternate is available for this issue.
 
 You can create a dataset for each user, and then set a per-dataset quota
 for each one:
 
 quota=size | none


as a side note, you do not need to worry about creating 400 filesystems.
A few thousand are no problem.

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Erratic behavior on 24T zpool

2010-06-18 Thread Arne Jansen

Sandon Van Ness wrote:

Sounds to me like something is wrong as on my 20 disk backup machine
with 20 1TB disks on a single raidz2 vdev I get the following with DD on
sequential reads/writes:

writes:

r...@opensolaris: 11:36 AM :/data# dd bs=1M count=10 if=/dev/zero
of=./100gb.bin
10+0 records in
10+0 records out
10485760 bytes (105 GB) copied, 233.257 s, 450 MB/s

reads:

r...@opensolaris: 11:44 AM :/data# dd bs=1M if=./100gb.bin of=/dev/null
10+0 records in
10+0 records out
10485760 bytes (105 GB) copied, 131.051 s, 800 MB/s

zpool iostat pool 10

gives me about the same values that DD gives me. Maybe you have a bad
drive somewhere? Which areca controller are you using as maybe you can
pull the smart info off the drives from a linux boot cd as some of the
controllers support that. Could be a bad drive somewhere.



didn't he say he already gets 400MB/s from dd, but zpool iostat only
show a few MB/s? What does zpool iostat show, the value before or after
dedup?
Curtis, to see if your physical setup is ok you should turn of dedup and
measure again. Otherwise you only measure the power of your machine to
dedup /dev/zero.

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen
Christopher George wrote:
 So why buy SSD for ZIL at all?
 
 For the record, not all SSDs ignore cache flushes.  There are at least 
 two SSDs sold today that guarantee synchronous write semantics; the 
 Sun/Oracle LogZilla and the DDRdrive X1.  Also, I believe it is more 

LogZilla? Are these those STEC-thingies? For the price of those I can
buy a battery backed-up RAID-controller and a few conventional drives.
For ZIL this will probably do better at a lower price than STEC.

The DDRdrive I wouldn't call a flash drive but rather a NVRAM-Card.
NVRAM-cards are the proper way to go for ZIL. Someone should build
one for  $600, PCIe x1 would be sufficient. Xilinx has some nice
Spartans :)

 accurate to describe the root cause as not power protecting on-board 
 volatile caches.  As the X25-E does implement the ATA FLUSH 
 CACHE command, but does not have the required power protection to 
 avoid transaction (data) loss.

You could say the same about hard drives. They also just need a proper
protection for their volatile cache...

--Arne

 
 Best regards,
 
 Christopher George
 Founder/CTO
 www.ddrdrive.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen
Arve Paalsrud wrote:
 Not to forget the The Deneva Reliability disks from OCZ that just got
 released. See
 http://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-emlc-ssd.html
 
 The Deneva Reliability family features built-in supercapacitor (SF-1500
 models) that acts as a temporary power backup in the event of sudden power
 loss, and enables the drive to complete its task ensuring no data loss.
 

This one looks really interesting. No price to find though, and no detail about
how many write cycles they can stand.

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] At what level does the “zfs” d irectory exist?

2010-06-16 Thread Arne Jansen
David Markey wrote:
 
 I have done a similar deployment,
 
 However we gave each student their own ZFS filesystem. Each of which had
 a .zfs directory in it.

Don't host 50k filesystems on a single pool. It's more pain than it's
worth.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] At what level does the “zfs ” directory exist?

2010-06-16 Thread Arne Jansen
MichaelHoy wrote:
 I’ve posted a query regarding the visibility of snapshots via CIFS here 
 (http://opensolaris.org/jive/thread.jspa?threadID=130577tstart=0)
 however, I’m beginning to suspect that it may be a more fundamental ZFS 
 question so I’m asking the same question here.
 
 At what level does the “zfs” directory exist?
 
 If  the “.zfs” subdirectory only exists as the direct child of the mount 
 point then can someone suggest how I can make it visible lower down without 
 requiring me (even if it were possible for 50k users) to make each users’ 
 home folder a file system?
 
 By way of a background, I’m looking at the possibility of hosting our 
 students personal file space on OpenSolaris since the capacities required go 
 well beyond my budget to keep investing in our NetApp kit.
 So far I’ve managed to implement the same functionality however, the 
 visibility of the snapshots to allow self-service file restores is a real 
 issue which may prevent me for going forward on this platform.
 
 I’d appreciate any suggestions.

Do you only want to share the filesystem via CIFS? Have you had a look
at the shadow_copy2 extension for samba? It maps the snapshots so windows
can access them via previous versions from the explorers context menu.

--Arne

 
 Thanks
 Michael.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen
David Magda wrote:
 On Wed, June 16, 2010 11:02, David Magda wrote:
 [...]
 Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD
 numbers see:
 
 s/suck/such/

ah, I tried to make sense from 'suck' in the sense of 'just writing
sequentially' or something like that ;)

 
 :)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen
David Magda wrote:
 On Wed, June 16, 2010 10:44, Arne Jansen wrote:
 David Magda wrote:

 I'm not sure you'd get the same latency and IOps with disk that you can
 with a good SSD:

 http://blogs.sun.com/brendan/entry/slog_screenshots
 [...]
 Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main
 pool. Because ZIL issues nearly sequential writes, due to the
 NVRAM-protection
 of the RAID-controller the disk can leave the write cache enabled. This
 means
 the disk can write essentially with full speed, meaning 150MB/s for a 15k
 drive.
 114000 4k writes/s are 456MB/s, so 3 spindles should do.
 
 Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD
 numbers see:
 
 http://blogs.sun.com/brendan/entry/l2arc_screenshots
 

oops, sorry, I should at least scrolled down a bit on your link... Nevertheless
I don't find it improbable to reach numbers like that for a proper RAID-setup.
Of cause it will take more space and power. Maybe someone has done some testing
on this.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen

Bob Friesenhahn wrote:

On Wed, 16 Jun 2010, Arne Jansen wrote:


Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or 
main
pool. Because ZIL issues nearly sequential writes, due to the 
NVRAM-protection
of the RAID-controller the disk can leave the write cache enabled. 
This means
the disk can write essentially with full speed, meaning 150MB/s for a 
15k drive.

114000 4k writes/s are 456MB/s, so 3 spindles should do.


Huh?  What does the battery backed memory of a RAID-controller have to 
do with the unprotected memory of a hard drive?  This does not compute.  


You're right, I took a wrong turn there. Of course the RAID-controller
disables the write cache of the disks. But because the controller ACKs
each write immediately (as long as it has buffer left), the requests can
be queued in the disk. This enables the disk to write continously.
I double checked before posting: I can nearly saturate a 15k disk if I
make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times
3 nearly matches the above mentioned 114k IOPS :)

Thanks,
Arne

The flushes that the RAID-controller acks need to be ultimately 
delivered to the disk or else there WILL be data loss. The RAID 
controller should not purge its own record until the disk reports that 
it has flushed its cache.  Once the RAID controller's cache is full, 
then it should start stalling writes.





Bob


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Arne Jansen

David Magda wrote:

On Wed, June 16, 2010 15:15, Arne Jansen wrote:


I double checked before posting: I can nearly saturate a 15k disk if I
make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times
3 nearly matches the above mentioned 114k IOPS :)


34K*3 = 102K. 12K isn't anything to sneeze at :)

So you'll need six disks to do what one SSD does: three spindles, and two
(mirrored) disks on each spindle for redundancy (drives are riskier than
SSDs).



ok, 4 spindles, we already have a raid controller available :) But personally
I trust drives more than SSDs.
Are the 114k with mirrored or striped logzillas? In any case there are two of
them, so I'd double that raid-controller setup also, being still cheaper than
the STEC devices.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Arne Jansen

There has been many threads in the past asking about ZIL
devices. Most of them end up in recommending Intel X-25
as an adequate device. Nevertheless there is always the
warning about them not heeding cache flushes. But what use
is a ZIL that ignores cache flushes? If I'm willing to
tolerate that (I'm not), I can just as well take a mechanical
drive and force zfs to not issue cache flushes to it. In
this case it can easily compete with SSD in regard to IOPS
and bandwidth. In case of a power failure I will likely
lose about as many writes as I do with SSDs, a few milliseconds.

So why buy SSD for ZIL at all?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-15 Thread Arne Jansen

Bob Friesenhahn wrote:

On Tue, 15 Jun 2010, Arne Jansen wrote:

In case of a power failure I will likely lose about as many writes as 
I do with SSDs, a few milliseconds.


I agree with your concerns, but the data loss may span as much as 30 
seconds rather than just a few milliseconds.


Wait, I'm talking about using SSD for ZIL vs. using a dedicated hard drive
for ZIL which is configured to ignore cache flushes. Do you say I can lose
30 seconds also if I use a badly behaving SSD?



Using an SSD as the ZIL allows zfs to turn a synchronous write into a 
normal batched async write which is scheduled for the next TXG.  Zfs 
intentionally postpones writes.


Without the SSD, zfs needs to write to an intent log in the main pool 
(consuming precious IOPS) or write directly to the main pool (consuming 
precious response latency).  Battery-backed RAM in the adaptor card or 
storage array can do almost as well as the SSD as long as the amount of 
data does not overrun the limited write cache.


Bob


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots, txgs and performance

2010-06-14 Thread Arne Jansen
Marcelo Leal wrote:
 Hello there,
  I think you should share it with the list, if you can, seems like an 
 interesting work. ZFS has some issues with snapshots and spa_sync performance 
 for snapshots deletion.

I'm a bit reluctant to post it to the list where it can still be found
years from now. Because the module is not compiled directly into ZFS
but is a separate module that makes heavy use of internal structures
of ZFS, it is designed for a specific version of ZFS (Solaris U8). It
might still load without problems for years, but already in the next
Solaris version it might wreak havoc because of a changed kernel structure.
A much better way would be to have a similar operation integrated into
the official source tree. I could try to build a patch if it has a
chance of getting accepted.

Until then, I have no problem with sharing it off-list.

--Arne

  
  Thanks
 
  Leal
 [ http://www.eall.com.br/blog ]

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Hi,

I known it's been discussed here more than once, and I read the
Evil tuning guide, but I didn't find a definitive statement:

There is absolutely no sense in having slog devices larger than
then main memory, because it will never be used, right?
ZFS will rather flush the txg to disk than reading back from
zil?
So there is a guideline to have enough slog to hold about 10
seconds of zil, but the absolute maximum value is the size of
main memory. Is this correct?

Thanks,
Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Arne Jansen

 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 
 Also:  A TXG is guaranteed to flush within 30 sec.  Let's suppose you have a
 super fast device, which is able to log 8Gbit/sec (which is unrealistic).
 That's 1Gbyte/sec, unrealistically theoretically possible, at best.  You do
 the math.  ;-)
 
 That being said, it's difficult to buy an SSD smaller than 32G.  So what are
 you going to do?

I'm still building my rotational write delay eliminating driver and am trying
to figure out how much space I can waste on the underlying device without ever
running into problems. I need half the physical memory, or, under the assumption
that it might be tunable, a maximum of my physical memory. It's good to know
a hard upper limit. The more I can waste, the faster the device will be.

Also, to stay in your line of argumentation, this super-fast slog is most
probably a DRAM-based, battery backed solution. In this case it will make
a difference if you buy 8 or 32GB ;)

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] size of slog device

2010-06-14 Thread Arne Jansen
Roy Sigurd Karlsbakk wrote:
 There is absolutely no sense in having slog devices larger than
 then main memory, because it will never be used, right?
 ZFS will rather flush the txg to disk than reading back from
 zil? So there is a guideline to have enough slog to hold about 10
 seconds of zil, but the absolute maximum value is the size of
 main memory. Is this correct?
 
 ZFS uses at most RAM/2 for ZIL

Thanks!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] panic after zfs mount

2010-06-13 Thread Arne Jansen

Thomas Nau wrote:

Dear all

We ran into a nasty problem the other day. One of our mirrored zpool
hosts several ZFS filesystems. After a reboot (all FS mounted at that
time an in use) the machine paniced (console output further down). After
detaching one of the mirrors the pool fortunately imported automatically
in a faulted state without mounting the filesystems. Offling the
unplugged device and clearing the fault allowed us to disable
auto-mounting the filesystems. Going through them one by one all but one
mounted OK. The one again triggered a panic. We left mounting on that
one disabled for now to be back in production after pulling data from
the backup tapes. Scrubbing didn't show any error so any idea what's
behind the problem? Any chance to fix the FS?


We had the same problem. Victor pointed my to

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6742788

with a workaround to mount the filesystem read-only to save the data.
I still hope to figure out the chain of events that causes this. Did you
use any extended attributes on this filesystem?

--
Arne

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are recursive snapshot destroy and rename atomic too?

2010-06-11 Thread Arne Jansen
Darren J Moffat wrote:
 But the following document says Recursive ZFS snapshots are created
 quickly as one atomic operation. The snapshots are created together (all
 at once) or not created at all.
 http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view
 
 I've looked at the code again - I miss read part of it - it does appear
 to be an all or nothing both for the create and destroy.
 

I read the code differently, zfs destroy does the iteration in the
zfs utility, not even in libzfs. The ioctl doesn't even have a recurse
flag.

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are recursive snapshot destroy and rename atomic too?

2010-06-11 Thread Arne Jansen
Darren J Moffat wrote:
 On 11/06/2010 11:42, Arne Jansen wrote:
 Darren J Moffat wrote:
 But the following document says Recursive ZFS snapshots are created
 quickly as one atomic operation. The snapshots are created together
 (all
 at once) or not created at all.
 http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view

 I've looked at the code again - I miss read part of it - it does appear
 to be an all or nothing both for the create and destroy.


 I read the code differently, zfs destroy does the iteration in the
 zfs utility, not even in libzfs. The ioctl doesn't even have a recurse
 flag.
 
 In zfs_do_destroy() if recurse is set we call zfs_destroy_snaps()
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/common/libzfs_dataset.c#zfs_destroy_snaps
 
 
 Which makes a single ioctl call ZFS_IOC_DESTROY_SNAPS
 

Ah! There's an extra ioctl for a recursive snapshot delete, I missed
that part. Many thanks! This will also help me with my multi-snapshot-
destroy-module. I should at least have checked with a simple truss...

You made my day :)

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen

Andrey Kuzmin wrote:
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net 
mailto:sensi...@gmx.net wrote:


Andrey Kuzmin wrote:

As to your results, it sounds almost too good to be true. As Bob
has pointed out, h/w design targeted hundreds IOPS, and it was
hard to believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache


Linear? May be sequential?


Aren't these synonyms? linear as opposed to random.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen

Andrey Kuzmin wrote:

Well, I'm more accustomed to  sequential vs. random, but YMMW.

As to 67000 512 byte writes (this sounds suspiciously close to 32Mb 
fitting into cache), did you have write-back enabled?




It's a sustained number, so it shouldn't matter.


Regards,
Andrey



On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net 
mailto:sensi...@gmx.net wrote:


Andrey Kuzmin wrote:

On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net
mailto:sensi...@gmx.net mailto:sensi...@gmx.net
mailto:sensi...@gmx.net wrote:

   Andrey Kuzmin wrote:

   As to your results, it sounds almost too good to be true.
As Bob
   has pointed out, h/w design targeted hundreds IOPS, and
it was
   hard to believe it can scale 100x. Fantastic.


   Hundreds IOPS is not quite true, even with hard drives. I
just tested
   a Hitachi 15k drive and it handles 67000 512 byte linear
write/s, cache


Linear? May be sequential?


Aren't these synonyms? linear as opposed to random.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Arne Jansen

Andrey Kuzmin wrote:
As to your results, it sounds almost too good to be true. As Bob has 
pointed out, h/w design targeted hundreds IOPS, and it was hard to 
believe it can scale 100x. Fantastic.


Hundreds IOPS is not quite true, even with hard drives. I just tested
a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache
enabled.

--Arne



Regards,
Andrey



On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized for
high latency devices like disk drives.  If you are creating a
driver stack, the design decisions you make when requests will
be satisfied in about 12ms would be much different than if
requests are satisfied in 50us.  Limitations of existing
software stacks are likely reasons why Sun is designing hardware
with more device interfaces and more independent devices.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots, txgs and performance

2010-06-07 Thread Arne Jansen

thomas wrote:

Very interesting. This could be useful for a number of us. Would you be willing 
to share your work?


No problem. I'll contact you off-list.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots, txgs and performance

2010-06-05 Thread Arne Jansen

Arne Jansen wrote:

Hi,

I have a setup with thousands of filesystems, each containing several
snapshots. For a good percentage of these filesystems I want to create
a snapshots once every hour, for others once every 2 hours and so forth.
I built some tools to do this, no problem so far.

While examining disk load on the system, I found out that load jumps
up whenever the snapshot creation process is running. Delving a bit
deeper it seems that every snapshot start a new txg. This seems to be
quite costly, about 100-150 IOPs. This takes about one second, so I can
create one snapshot per second. Creating all necessary snapshot for one
hour takes about 45 minutes. During this time the disks are at 70%
utilization and txgs are back-to-back. So I need to optimize this.

Looking at the code it seems that recursive snapshots are being
collected into a single txg. So my aim is to collect all necessary
snapshots in a single txg, too. Using libzfs or ioctl I haven't found
any way to do this. I cannot just use recursive snapshots, because
not all filesystems need to be snapshotted. Same with snapshot deletion.

My idea is to write a small kernel module that roughly duplicates the
code of zfs_ioc_snapshot, but instead of being fully recursive gets
passed a list of filesystems to snapshot.

My questions:
 - is there any easier way to bring down disk load and accelerate
   snapshot creation?
 - are there any arguments, why my approach isn't feasible?



Just as a followup, the module works smoothly for creations. If I take
100 snapshots at a time, the gain is roughly a factor of 100 :)
Because there are no recursive deletions in kernel space (recursive
deletions are handled by libzfs), I haven't found an easy way to do
likewise for delete. Maybe someday when I get a better grip on the
source...

--Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Arne Jansen
Brent Jones wrote:
 
 I don't think you'll find the performance you paid for with ZFS and
 Solaris at this time. I've been trying to more than a year, and
 watching dozens, if not hundreds of threads.
 Getting half-ways decent performance from NFS and ZFS is impossible
 unless you disable the ZIL.

A few days ago I posted to nfs-discuss with a proposal to add some
mount/share options to change semantics of a nfs-mounted filesystem
so that they parallel those of a local filesystem.
The main point is that data gets flushed to stable storage only if the
client explicitly requests so via fsync or O_DSYNC, not implicitly
with every close().
That would give you the performance you are seeking without sacrificing
data integrity for applications that need it.

I get the impression that I'm not the only one who could be interested
in that ;)

-Arne

 
 You'd be better off getting NetApp
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Snapshots, txgs and performance

2010-03-27 Thread Arne Jansen
Hi,

I have a setup with thousands of filesystems, each containing several
snapshots. For a good percentage of these filesystems I want to create
a snapshots once every hour, for others once every 2 hours and so forth.
I built some tools to do this, no problem so far.

While examining disk load on the system, I found out that load jumps
up whenever the snapshot creation process is running. Delving a bit
deeper it seems that every snapshot start a new txg. This seems to be
quite costly, about 100-150 IOPs. This takes about one second, so I can
create one snapshot per second. Creating all necessary snapshot for one
hour takes about 45 minutes. During this time the disks are at 70%
utilization and txgs are back-to-back. So I need to optimize this.

Looking at the code it seems that recursive snapshots are being
collected into a single txg. So my aim is to collect all necessary
snapshots in a single txg, too. Using libzfs or ioctl I haven't found
any way to do this. I cannot just use recursive snapshots, because
not all filesystems need to be snapshotted. Same with snapshot deletion.

My idea is to write a small kernel module that roughly duplicates the
code of zfs_ioc_snapshot, but instead of being fully recursive gets
passed a list of filesystems to snapshot.

My questions:
 - is there any easier way to bring down disk load and accelerate
   snapshot creation?
 - are there any arguments, why my approach isn't feasible?

Thanks for any hits.

-Arne
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resilver speed

2009-10-23 Thread Arne Jansen
Hi,

I have a pool of 22 1T SATA disks in a RAIDZ3 configuration. It is filled with 
files of an average size of 2MB. I filled it randomly to resemble the expected 
workload in production use.
Problems arise when I try to scrub/resilver this pool. This operation takes the 
better part of a week (!). During this time the disk being resilvered is at 
100% utilisation with 300 writes/s, but only 3MB/s, which is only about 3% of 
its best case performance.
Having a window of one week with degraded redundancy is intolerable. It is 
quite likely that one loses more disks during this period, eventually leading 
to a total loss of the pool, not to mention the degraded performance during 
this period. In fact, in previous tests I lost a pool in a 6x11 RAIDZ2 
configuration.

I skimmed through the code of resilver and found out that it just enumerates 
all object in the pool and checks them one by one, having maxinflight 
I/O-request in parallel. Because this does not take the order of data ondisk 
into account it leads to this pathological performance. Also I found Bug 
6678033 stating that a prefetch might fix this.

Now my questions:
1) Are there tunings that could speed up resilver, possibly with a negative 
effect on normal performance? I thought of raising recordsize to the expected 
filesize of 2MB. Could this help?
2) What is the state of the fix? When will it be ready?
3) Do you have any configuration hints for setting up a pool layout which might 
help resilver performance? (aside from using hardware RAID instead of RAIDZ)

Thanks for any hints.
sensille
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss