Re: [zfs-discuss] Making ZIL faster

2012-10-05 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Neil Perrin [mailto:neil.per...@oracle.com]
 
 In general - yes, but it really depends. Multiple synchronous writes of any
 size
 across multiple file systems will fan out across the log devices. That is
 because there is a separate independent log chain for each file system.
 
 Also large synchronous writes (eg 1MB) within a specific file system will be
 spread out.
 The ZIL code will try to allocate a block to hold all the records it needs to
 commit up to the largest block size - which currently for you should be 128KB.
 Anything larger will allocate a new block - on a different device if there are
 multiple devices.
 
 However, lots of small synchronous writes to the same file system might not
 use more than one 128K block and benefit from multiple slog devices.

That is an awesome explanation.  Thank you.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Schweiss, Chip
Thanks for all the input.  It seems information on the performance of the
ZIL is sparse and scattered.   I've spent significant time researching this
the past day.  I'll summarize what I've found.   Please correct me if I'm
wrong.

   - The ZIL can have any number of SSDs attached either mirror or
   individually.   ZFS will stripe across these in a raid0 or raid10 fashion
   depending on how you configure.
   - To determine the true maximum streaming performance of the ZIL setting
   sync=disabled will only use the in RAM ZIL.   This gives up power
   protection to synchronous writes.
   - Many SSDs do not help protect against power failure because they have
   their own ram cache for writes.  This effectively makes the SSD useless for
   this purpose and potentially introduces a false sense of security.  (These
   SSDs are fine for L2ARC)
   - Mirroring SSDs is only helpful if one SSD fails at the time of a power
   failure.  This leave several unanswered questions.  How good is ZFS at
   detecting that an SSD is no longer a reliable write target?   The chance of
   silent data corruption is well documented about spinning disks.  What
   chance of data corruption does this introduce with up to 10 seconds of data
   written on SSD.  Does ZFS read the ZIL during a scrub to determine if our
   SSD is returning what we write to it?
   - Zpool versions 19 and higher should be able to survive a ZIL failure
   only loosing the uncommitted data.   However, I haven't seen good enough
   information that I would necessarily trust this yet.
   - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
   SSDs.   I'm not sure if that is current, but I can't find any reports of
   better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
   would push past this.

Anyone care to post their performance numbers on current hardware with E5
processors, and ram based ZIL solutions?

Thanks to everyone who has responded and contacted me directly on this
issue.

-Chip
On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
andrew.gabr...@cucumber.demon.co.uk wrote:

 Edward Ned Harvey (**opensolarisisdeadlongliveopens**olaris) wrote:

 From: 
 zfs-discuss-bounces@**opensolaris.orgzfs-discuss-boun...@opensolaris.org[mailto:
 zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip

 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the
 ZIL to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


 Temporarily set sync=disabled
 Or, depending on your application, leave it that way permanently.  I
 know, for the work I do, most systems I support at most locations have
 sync=disabled.  It all depends on the workload.


 Noting of course that this means that in the case of an unexpected system
 outage or loss of connectivity to the disks, synchronous writes since the
 last txg commit will be lost, even though the applications will believe
 they are secured to disk. (ZFS filesystem won't be corrupted, but it will
 look like it's been wound back by up to 30 seconds when you reboot.)

 This is fine for some workloads, such as those where you would start again
 with fresh data and those which can look closely at the data to see how far
 they got before being rudely interrupted, but not for those which rely on
 the Posix semantics of synchronous writes/syncs meaning data is secured on
 non-volatile storage when the function returns.

 --
 Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Andrew Gabriel [mailto:andrew.gabr...@cucumber.demon.co.uk]
 
  Temporarily set sync=disabled
  Or, depending on your application, leave it that way permanently.  I know,
 for the work I do, most systems I support at most locations have
 sync=disabled.  It all depends on the workload.
 
 Noting of course that this means that in the case of an unexpected system
 outage or loss of connectivity to the disks, synchronous writes since the last
 txg commit will be lost, even though the applications will believe they are
 secured to disk. (ZFS filesystem won't be corrupted, but it will look like 
 it's
 been wound back by up to 30 seconds when you reboot.)
 
 This is fine for some workloads, such as those where you would start again
 with fresh data 

It's fine for any load where you don't have clients keeping track of your state.

Examples where it's not fine:  You're processing credit card transactions.  You 
just finished processing a transaction, system crashes, and you forget about 
it.  Not fine, because systems external to yourself are aware of state that is 
in the future of your state, and you aren't aware of it.

You're a NFS server.  Some clients write some files, you say they're written, 
you crash, and forget about it.  Now you reboot, start serving NFS again, and 
the client still has a file handle for something it thinks exists ... but 
according to you in your new state, it doesn't exist.

You're a database server, and your clients are external to yourself.  They do 
transactions against you, you say they're complete, and you forget about it.

But it's ok:  

You're a NFS server, and you're configured to NOT restart NFS automatically 
upon reboot.  In the event of an ungraceful crash, admin intervention is 
required, and the admin is aware, he needs to reboot the NFS clients before 
starting the NFS services again.

You're a database server, and your clients are all inside yourself, either as 
VM's, or services of various kinds

You're a VM server, and all of your VM's are desktop clients, like a windows 7 
machine for example.  None of your guests are servers in and of themselves 
maintaining state with external entities (such as processing credit card 
transactions, serving a database, or file server.)  By mere virtue of the fact 
that you crash ungracefully implies your guests also crash ungracefully.  You 
all reboot, rewind a few seconds, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 . The ZIL can have any number of SSDs attached either mirror or
 individually.   ZFS will stripe across these in a raid0 or raid10 fashion
 depending on how you configure.

I'm regurgitating something somebody else said - but I don't know where.  I 
believe multiple ZIL devices don't get striped.  They get round-robin'd.  This 
means your ZIL can absolutely become a bottleneck, if you're doing sustained 
high throughput (not high IOPS) sync writes.  But the way to prevent that 
bottleneck is by tuning the ... I don't know the names of the parameters.  Some 
parameters that indicate a sync write larger than X should skip the ZIL and go 
directly to pool.


 . To determine the true maximum streaming performance of the ZIL setting
 sync=disabled will only use the in RAM ZIL.   This gives up power protection 
 to
 synchronous writes.

There is no RAM ZIL.  The basic idea behind ZIL is like this:  Some 
applications simply tell the system to write and the system will buffer these 
writes in memory, and the application will continue processing.  But some 
applications do not want the OS to buffer writes, so they issue writes in 
sync mode.  These applications will issue the write command, and they will 
block there, until the OS says it's written to nonvolatile storage.  In ZFS, 
this means the transaction gets written to the ZIL, and then it gets put into 
the memory buffer just like any other write.  Upon reboot, when the filesystem 
is mounting, ZFS will always look in the ZIL to see if there are any 
transactions that have not yet been played to disk.

So, when you set sync=disabled, you're just bypassing that step.  You're lying 
to the applications, if they say I want to know when this is written to disk, 
and you just immediately say Yup, it's done unconditionally.  This is the 
highest performance thing you could possibly do - but depending on your system 
workload, could put you at risk for data loss.


 . Mirroring SSDs is only helpful if one SSD fails at the time of a power
 failure.  This leave several unanswered questions.  How good is ZFS at
 detecting that an SSD is no longer a reliable write target?   The chance of
 silent data corruption is well documented about spinning disks.  What chance
 of data corruption does this introduce with up to 10 seconds of data written
 on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is
 returning what we write to it?

Not just power loss -- any ungraceful crash.  

ZFS doesn't have any way to scrub ZIL devices, so it's not very good at 
detecting failed ZIL devices.  There is definitely the possibility for an SSD 
to enter a failure mode where you write to it, it doesn't complain, but you 
wouldn't be able to read it back if you tried.  Also, upon ungraceful crash, 
even if you try to read that data, and fail to get it back, there's no way to 
know that you should have expected something.  So you still don't detect the 
failure.

If you want to maintain your SSD periodically, you should do something like:  
Remove it as a ZIL device, create a new pool with just this disk in it, write a 
bunch of random data to the new junk pool, scrub the pool, then destroy the 
junk pool and return it as a ZIL device to the main pool.  This does not 
guarantee anything - but then - nothing anywhere guarantees anything.  This is 
a good practice, and it definitely puts you into a territory of reliability 
better than the competing alternatives.


 . Zpool versions 19 and higher should be able to survive a ZIL failure only
 loosing the uncommitted data.   However, I haven't seen good enough
 information that I would necessarily trust this yet.

That was a very long time ago.  (What, 2-3 years?)  It's very solid now.


 . Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
 SSDs.   I'm not sure if that is current, but I can't find any reports of 
 better
 performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push
 past this.

Whenever I measure the sustainable throughput of a SSD, HDD, DDRDrive, or 
anything else ... Very few devices can actually sustain faster than 1Gb/s, for 
use as a ZIL or anything else.  Published specs are often higher, but not 
realistic.

If you are ZIL bandwidth limited, you should consider tuning the size of stuff 
that goes to ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Andrew Gabriel

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Schweiss, Chip

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.


Noting of course that this means that in the case of an unexpected system 
outage or loss of connectivity to the disks, synchronous writes since the last 
txg commit will be lost, even though the applications will believe they are 
secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's 
been wound back by up to 30 seconds when you reboot.)

This is fine for some workloads, such as those where you would start again with 
fresh data and those which can look closely at the data to see how far they got 
before being rudely interrupted, but not for those which rely on the Posix 
semantics of synchronous writes/syncs meaning data is secured on non-volatile 
storage when the function returns.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

  
  
On 10/04/12 05:30, Schweiss, Chip wrote:
Thanks for all the input. It seems information on the
  performance of the ZIL is sparse and scattered. I've spent
  significant time researching this the past day. I'll summarize
  what I've found. Please correct me if I'm wrong.
  
The ZIL can have any number of SSDs attached either mirror
  or individually. ZFS will stripe across these in a raid0 or
  raid10 fashion depending on how you configure.
  


The ZIL code chains blocks together and these are allocated round
robin among slogs or
if they don't exist then the main pool devices.


  
To determine the true maximum streaming performance of the
  ZIL setting sync=disabled will only use the in RAM ZIL. This
  gives up power protection to synchronous writes.
  


There is no RAM ZIL. If sync=disabled then all writes are
asynchronous and are written
as part of the periodic ZFS transaction group (txg) commit that
occurs every 5 seconds.


  
Many SSDs do not help protect against power failure because
  they have their own ram cache for writes. This effectively
  makes the SSD useless for this purpose and potentially
  introduces a false sense of security. (These SSDs are fine
  for L2ARC)

  


The ZIL code issues a write cache flush to all devices it has
written before returning
from the system call. I've heard, that not all devices obey the
flush but we consider them
as broken hardware. I don't have a list to avoid.


  

  

Mirroring SSDs is only helpful if one SSD fails at the time
  of a power failure. This leave several unanswered questions.
  How good is ZFS at detecting that an SSD is no longer a
  reliable write target? The chance of silent data corruption
  is well documented about spinning disks. What chance of data
  corruption does this introduce with up to 10 seconds of data
  written on SSD. Does ZFS read the ZIL during a scrub to
  determine if our SSD is returning what we write to it?

  


If the ZIL code gets a block write failure it will force the txg to
commit before returning.
It will depend on the drivers and IO subsystem as to how hard it
tries to write the block.


  

  

Zpool versions 19 and higher should be able to survive a ZIL
  failure only loosing the uncommitted data.  However, I
  haven't seen good enough information that I would necessarily
  trust this yet. 

  


This has been available for quite a while and I haven't heard of any
bugs in this area.


  

  Several threads seem to suggest a ZIL throughput limit of
  1Gb/s with SSDs. I'm not sure if that is current, but I
  can't find any reports of better performance. I would
  suspect that DDR drive or Zeus RAM as ZIL would push past
  this.
  


1GB/s seems very high, but I don't have any numbers to share.


  

  
  Anyone care to post their performance numbers on current
hardware with E5 processors, and ram based ZIL solutions?
  Thanks to everyone who has responded and contacted me directly
on this issue.
  -Chip
  
  On Thu, Oct 4, 2012 at 3:03 AM, Andrew
Gabriel andrew.gabr...@cucumber.demon.co.uk
wrote:

  
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
  wrote:
  

  From: zfs-discuss-boun...@opensolaris.org
  [mailto:zfs-discuss-
  boun...@opensolaris.org] On
  Behalf Of Schweiss, Chip
  
  How can I determine for sure that my ZIL is my
  bottleneck? If it is the
  bottleneck, is it possible to keep adding mirrored
  pairs of SSDs to the ZIL to
  make it faster? Or should I be looking for a DDR
  drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way
permanently. I know, for the work I do, most systems I
support at most locations have sync=disabled. It all
depends on the workload.
  
  

  
  Noting of course that this means that in the case of an
  unexpected system outage or loss of connectivity to the disks,
  synchronous writes since the last txg commit will be lost,
  even though the applications will believe they are 

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling
Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...

On Oct 4, 2012, at 8:31 AM, Neil Perrin neil.per...@oracle.com wrote:

 On 10/04/12 05:30, Schweiss, Chip wrote:
 
 Thanks for all the input.  It seems information on the performance of the 
 ZIL is sparse and scattered.   I've spent significant time researching this 
 the past day.  I'll summarize what I've found.   Please correct me if I'm 
 wrong.
 The ZIL can have any number of SSDs attached either mirror or individually.  
  ZFS will stripe across these in a raid0 or raid10 fashion depending on how 
 you configure.
 
 The ZIL code chains blocks together and these are allocated round robin among 
 slogs or
 if they don't exist then the main pool devices.
 
 To determine the true maximum streaming performance of the ZIL setting 
 sync=disabled will only use the in RAM ZIL.   This gives up power protection 
 to synchronous writes.
 
 There is no RAM ZIL. If sync=disabled then all writes are asynchronous and 
 are written
 as part of the periodic ZFS transaction group (txg) commit that occurs every 
 5 seconds.
 
 Many SSDs do not help protect against power failure because they have their 
 own ram cache for writes.  This effectively makes the SSD useless for this 
 purpose and potentially introduces a false sense of security.  (These SSDs 
 are fine for L2ARC)
 
 The ZIL code issues a write cache flush to all devices it has written before 
 returning
 from the system call. I've heard, that not all devices obey the flush but we 
 consider them
 as broken hardware. I don't have a list to avoid.
 
 
 Mirroring SSDs is only helpful if one SSD fails at the time of a power 
 failure.  This leave several unanswered questions.  How good is ZFS at 
 detecting that an SSD is no longer a reliable write target?   The chance of 
 silent data corruption is well documented about spinning disks.  What chance 
 of data corruption does this introduce with up to 10 seconds of data written 
 on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is 
 returning what we write to it?
 
 If the ZIL code gets a block write failure it will force the txg to commit 
 before returning.
 It will depend on the drivers and IO subsystem as to how hard it tries to 
 write the block.
 
 
 Zpool versions 19 and higher should be able to survive a ZIL failure only 
 loosing the uncommitted data.   However, I haven't seen good enough 
 information that I would necessarily trust this yet. 
 
 This has been available for quite a while and I haven't heard of any bugs in 
 this area.
 
 Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs.   
 I'm not sure if that is current, but I can't find any reports of better 
 performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push 
 past this.
 
 1GB/s seems very high, but I don't have any numbers to share.

It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
 -- richard

 
   
 Anyone care to post their performance numbers on current hardware with E5 
 processors, and ram based ZIL solutions?  
 
 Thanks to everyone who has responded and contacted me directly on this issue.
 
 -Chip
 On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
 andrew.gabr...@cucumber.demon.co.uk wrote:
 Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL 
 to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
 
 Temporarily set sync=disabled
 Or, depending on your application, leave it that way permanently.  I know, 
 for the work I do, most systems I support at most locations have 
 sync=disabled.  It all depends on the workload.
 
 Noting of course that this means that in the case of an unexpected system 
 outage or loss of connectivity to the disks, synchronous writes since the 
 last txg commit will be lost, even though the applications will believe they 
 are secured to disk. (ZFS filesystem won't be corrupted, but it will look 
 like it's been wound back by up to 30 seconds when you reboot.)
 
 This is fine for some workloads, such as those where you would start again 
 with fresh data and those which can look closely at the data to see how far 
 they got before being rudely interrupted, but not for those which rely on 
 the Posix semantics of synchronous writes/syncs meaning data is secured on 
 non-volatile storage when the function returns.
 
 -- 
 Andrew
 
 
 
 

Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Schweiss, Chip
Again thanks for the input and clarifications.

I would like to clarify the numbers I was talking about with ZiL
performance specs I was seeing talked about on other forums.   Right now
I'm getting streaming performance of sync writes at about 1 Gbit/S.   My
target is closer to 10Gbit/S.   If I get to build it this system, it will
house a decent size VMware NFS storage w/ 200+ VMs, which will be dual
connected via 10Gbe.   This is all medical imaging research.  We move data
around by the TB and fast streaming is imperative.

On the system I've been testing with is 10Gbe connected and I have about 50
VMs running very happily, and haven't yet found my random I/O limit.
However every time, I storage vMotion a handful of additional VMs, the ZIL
seems to max out it's writing speed to the SSDs and random I/O also
suffers.   With out the SSD ZIL, random I/O is very poor.   I will be doing
some testing with sync=off, tomorrow and see how things perform.

If anyone can testify to a ZIL device(s) that can keep up with 10GBe or
more streaming synchronous writes please let me know.

-Chip



On Thu, Oct 4, 2012 at 1:33 PM, Richard Elling richard.ell...@gmail.comwrote:


 This has been available for quite a while and I haven't heard of any bugs
 in this area.


- Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
SSDs.   I'm not sure if that is current, but I can't find any reports of
better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
would push past this.


 1GB/s seems very high, but I don't have any numbers to share.


 It is not unusual for workloads to exceed the performance of a single
 device.
 For example, if you have a device that can achieve 700 MB/sec, but a
 workload
 generated by lots of clients accessing the server via 10GbE (1 GB/sec),
 then it
 should be immediately obvious that the slog needs to be striped.
 Empirically,
 this is also easy to measure.
  -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 If I get to build it this system, it will house a decent size VMware
 NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.   This is all
 medical imaging research.  We move data around by the TB and fast
 streaming is imperative.

This might not carry over to vmware, iscsi vs nfs.  But with virtualbox, using 
a local file versus using a local zvol, I have found the zvol is much faster 
for the guest OS.  Also, by default the zvol will have smarter reservations 
(refreservation) which seems to be a good thing.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 The ZIL code chains blocks together and these are allocated round robin
 among slogs or
 if they don't exist then the main pool devices.

So, if somebody is doing sync writes as fast as possible, would they gain more 
bandwidth by adding multiple slog devices?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Richard Elling

On Oct 4, 2012, at 1:33 PM, Schweiss, Chip c...@innovates.com wrote:

 Again thanks for the input and clarifications.
 
 I would like to clarify the numbers I was talking about with ZiL performance 
 specs I was seeing talked about on other forums.   Right now I'm getting 
 streaming performance of sync writes at about 1 Gbit/S.   My target is closer 
 to 10Gbit/S.   If I get to build it this system, it will house a decent size 
 VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.   
 This is all medical imaging research.  We move data around by the TB and fast 
 streaming is imperative.  
 
 On the system I've been testing with is 10Gbe connected and I have about 50 
 VMs running very happily, and haven't yet found my random I/O limit. However 
 every time, I storage vMotion a handful of additional VMs, the ZIL seems to 
 max out it's writing speed to the SSDs and random I/O also suffers.   With 
 out the SSD ZIL, random I/O is very poor.   I will be doing some testing with 
 sync=off, tomorrow and see how things perform.
 
 If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more 
 streaming synchronous writes please let me know.  

Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3 
GBytes/sec of 
storage vmotion on a relatively modest system. To sustain that sort of thing 
often requires
full system-level tuning and proper systems engineering design. Fortunately, 
people 
tend to not do storage vmotion on a continuous basis.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Neil Perrin

On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

The ZIL code chains blocks together and these are allocated round robin
among slogs or
if they don't exist then the main pool devices.

So, if somebody is doing sync writes as fast as possible, would they gain more 
bandwidth by adding multiple slog devices?


In general - yes, but it really depends. Multiple synchronous writes of any size
across multiple file systems will fan out across the log devices. That is
because there is a separate independent log chain for each file system.

Also large synchronous writes (eg 1MB) within a specific file system will be 
spread out.
The ZIL code will try to allocate a block to hold all the records it needs to
commit up to the largest block size - which currently for you should be 128KB.
Anything larger will allocate a new block - on a different device if there are
multiple devices.

However, lots of small synchronous writes to the same file system might not
use more than one 128K block and benefit from multiple slog devices.

Neil.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Making ZIL faster

2012-10-03 Thread Schweiss, Chip
I'm in the planing stages of a rather larger ZFS system to house
approximately 1 PB of data.

I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
the bottle neck for large bursts of data being written.I can't confirm
this for sure, but the when throwing enough data at my storage pool and the
write latency starts rising, the ZIL write speed hangs close the max
sustained throughput I've measured on the SSD (~200 MB/s).

The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and showed
~1300MB/s serial read and ~800MB/s serial write speed.

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

Thanks for any input,
-Chip
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Timothy Coalson
I found something similar happening when writing over NFS (at significantly
lower throughput than available on the system directly), specifically that
effectively all data, even asynchronous writes, were being written to the
ZIL, which I eventually traced (with help from Richard Elling and others on
this list) at least partially to the linux NFS client issuing commit
requests before ZFS wanted to write the asynchronous data to a txg.  I
tried fiddling with zfs_write_limit_override to get more data onto normal
vdevs faster, but this reduced performance (perhaps setting a tunable to
make ZFS not throttle writes while hitting the write limit could fix that),
and didn't cause it to go significantly easier on the ZIL devices.  I
decided to live with the default behavior, since my main bottleneck is
ethernet anyway, and the projected lifespan of the ZIL devices was fairly
large due to our workload.

I did find that setting logbias=throughput on a zfs filesystem caused it to
act as though the ZIL devices weren't there, which actually reduced commit
times under continuous streaming writes (mostly due to having more
throughput for the same amount of data to commit, in large chunks, but the
zilstat script also reported less writing to the ZIL blocks (which are
allocated from normal vdevs without a ZIL device, or with
logbias=throughput) under this condition, so perhaps there is more to the
story), so if you have different workloads for different datasets, this
could help (since it isn't a poolwide setting).  Obviously, small
synchronous writes to that zfs filesystem will take a large hit from this
setting.

It would be nice if there was a feature in ZFS that could direct small
commits to ZIL blocks on log devices, but behave like logbias=throughput
for large commits.  It would probably need manual tuning, but it would
treat SSD log devices more gently, and increase performance for large
contiguous writes.

If you can't configure ZFS to write less data to the ZIL, I think a RAM
based ZIL device would be a good way to get throughput up higher (and less
worries about flash endurance, etc).

Tim

On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip c...@innovates.com wrote:

 I'm in the planing stages of a rather larger ZFS system to house
 approximately 1 PB of data.

 I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
 the bottle neck for large bursts of data being written.I can't confirm
 this for sure, but the when throwing enough data at my storage pool and the
 write latency starts rising, the ZIL write speed hangs close the max
 sustained throughput I've measured on the SSD (~200 MB/s).

 The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
 showed ~1300MB/s serial read and ~800MB/s serial write speed.

 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
 to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

 Thanks for any input,
 -Chip



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Timothy Coalson
To answer your questions more directly, zilstat is what I used to check
what the ZIL was doing:

http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

While I have added a mirrored log device, I haven't tried adding multiple
sets of mirror log devices, but I think it should work.  I believe that a
failed unmirrored log device is only a problem if the pool is ungracefully
closed before ZFS notices that the log device failed (ie, simultaneous
power failure and log device failure), so mirroring them may not be
required.

Tim

On Wed, Oct 3, 2012 at 2:54 PM, Timothy Coalson tsc...@mst.edu wrote:

 I found something similar happening when writing over NFS (at
 significantly lower throughput than available on the system directly),
 specifically that effectively all data, even asynchronous writes, were
 being written to the ZIL, which I eventually traced (with help from Richard
 Elling and others on this list) at least partially to the linux NFS client
 issuing commit requests before ZFS wanted to write the asynchronous data to
 a txg.  I tried fiddling with zfs_write_limit_override to get more data
 onto normal vdevs faster, but this reduced performance (perhaps setting a
 tunable to make ZFS not throttle writes while hitting the write limit could
 fix that), and didn't cause it to go significantly easier on the ZIL
 devices.  I decided to live with the default behavior, since my main
 bottleneck is ethernet anyway, and the projected lifespan of the ZIL
 devices was fairly large due to our workload.

 I did find that setting logbias=throughput on a zfs filesystem caused it
 to act as though the ZIL devices weren't there, which actually reduced
 commit times under continuous streaming writes (mostly due to having more
 throughput for the same amount of data to commit, in large chunks, but the
 zilstat script also reported less writing to the ZIL blocks (which are
 allocated from normal vdevs without a ZIL device, or with
 logbias=throughput) under this condition, so perhaps there is more to the
 story), so if you have different workloads for different datasets, this
 could help (since it isn't a poolwide setting).  Obviously, small
 synchronous writes to that zfs filesystem will take a large hit from this
 setting.

 It would be nice if there was a feature in ZFS that could direct small
 commits to ZIL blocks on log devices, but behave like logbias=throughput
 for large commits.  It would probably need manual tuning, but it would
 treat SSD log devices more gently, and increase performance for large
 contiguous writes.

 If you can't configure ZFS to write less data to the ZIL, I think a RAM
 based ZIL device would be a good way to get throughput up higher (and less
 worries about flash endurance, etc).

 Tim

 On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip c...@innovates.com wrote:

 I'm in the planing stages of a rather larger ZFS system to house
 approximately 1 PB of data.

 I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
 the bottle neck for large bursts of data being written.I can't confirm
 this for sure, but the when throwing enough data at my storage pool and the
 write latency starts rising, the ZIL write speed hangs close the max
 sustained throughput I've measured on the SSD (~200 MB/s).

 The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
 showed ~1300MB/s serial read and ~800MB/s serial write speed.

 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
 to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

 Thanks for any input,
 -Chip



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-03 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Schweiss, Chip
 
 How can I determine for sure that my ZIL is my bottleneck?  If it is the
 bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
 make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss