Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...

On Oct 4, 2012, at 8:31 AM, Neil Perrin <> wrote:

> On 10/04/12 05:30, Schweiss, Chip wrote:
>> Thanks for all the input.  It seems information on the performance of the 
>> ZIL is sparse and scattered.   I've spent significant time researching this 
>> the past day.  I'll summarize what I've found.   Please correct me if I'm 
>> wrong.
>> The ZIL can have any number of SSDs attached either mirror or individually.  
>>  ZFS will stripe across these in a raid0 or raid10 fashion depending on how 
>> you configure.
> The ZIL code chains blocks together and these are allocated round robin among 
> slogs or
> if they don't exist then the main pool devices.
>> To determine the true maximum streaming performance of the ZIL setting 
>> sync=disabled will only use the in RAM ZIL.   This gives up power protection 
>> to synchronous writes.
> There is no RAM ZIL. If sync=disabled then all writes are asynchronous and 
> are written
> as part of the periodic ZFS transaction group (txg) commit that occurs every 
> 5 seconds.
>> Many SSDs do not help protect against power failure because they have their 
>> own ram cache for writes.  This effectively makes the SSD useless for this 
>> purpose and potentially introduces a false sense of security.  (These SSDs 
>> are fine for L2ARC)
> The ZIL code issues a write cache flush to all devices it has written before 
> returning
> from the system call. I've heard, that not all devices obey the flush but we 
> consider them
> as broken hardware. I don't have a list to avoid.
>> Mirroring SSDs is only helpful if one SSD fails at the time of a power 
>> failure.  This leave several unanswered questions.  How good is ZFS at 
>> detecting that an SSD is no longer a reliable write target?   The chance of 
>> silent data corruption is well documented about spinning disks.  What chance 
>> of data corruption does this introduce with up to 10 seconds of data written 
>> on SSD.  Does ZFS read the ZIL during a scrub to determine if our SSD is 
>> returning what we write to it?
> If the ZIL code gets a block write failure it will force the txg to commit 
> before returning.
> It will depend on the drivers and IO subsystem as to how hard it tries to 
> write the block.
>> Zpool versions 19 and higher should be able to survive a ZIL failure only 
>> loosing the uncommitted data.   However, I haven't seen good enough 
>> information that I would necessarily trust this yet. 
> This has been available for quite a while and I haven't heard of any bugs in 
> this area.
>> Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs.   
>> I'm not sure if that is current, but I can't find any reports of better 
>> performance.   I would suspect that DDR drive or Zeus RAM as ZIL would push 
>> past this.
> 1GB/s seems very high, but I don't have any numbers to share.

It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
 -- richard

>> Anyone care to post their performance numbers on current hardware with E5 
>> processors, and ram based ZIL solutions?  
>> Thanks to everyone who has responded and contacted me directly on this issue.
>> -Chip
>> On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel 
>> <> wrote:
>> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
>> From: [mailto:zfs-discuss-
>>] On Behalf Of Schweiss, Chip
>> How can I determine for sure that my ZIL is my bottleneck?  If it is the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL 
>> to
>> make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
>> Temporarily set sync=disabled
>> Or, depending on your application, leave it that way permanently.  I know, 
>> for the work I do, most systems I support at most locations have 
>> sync=disabled.  It all depends on the workload.
>> Noting of course that this means that in the case of an unexpected system 
>> outage or loss of connectivity to the disks, synchronous writes since the 
>> last txg commit will be lost, even though the applications will believe they 
>> are secured to disk. (ZFS filesystem won't be corrupted, but it will look 
>> like it's been wound back by up to 30 seconds when you reboot.)
>> This is fine for some workloads, such as those where you would start again 
>> with fresh data and those which can look closely at the data to see how far 
>> they got before being rudely interrupted, but not for those which rely on 
>> the Posix semantics of synchronous writes/syncs meaning data is secured on 
>> non-volatile storage when the function returns.
>> -- 
>> Andrew
>> _______________________________________________
>> zfs-discuss mailing list
> _______________________________________________
> zfs-discuss mailing list


zfs-discuss mailing list
  • [zfs-discuss] Mak... Schweiss, Chip
    • Re: [zfs-dis... Timothy Coalson
    • Re: [zfs-dis... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
      • Re: [zfs... Andrew Gabriel
        • Re: ... Schweiss, Chip
          • ... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
          • ... Neil Perrin
            • ... Richard Elling
              • ... Schweiss, Chip
                • ... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
                • ... Richard Elling
            • ... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
              • ... Neil Perrin
                • ... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
        • Re: ... Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

Reply via email to