Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...
On Oct 4, 2012, at 8:31 AM, Neil Perrin <neil.per...@oracle.com> wrote:
> On 10/04/12 05:30, Schweiss, Chip wrote:
>> Thanks for all the input. It seems information on the performance of the
>> ZIL is sparse and scattered. I've spent significant time researching this
>> the past day. I'll summarize what I've found. Please correct me if I'm
>> The ZIL can have any number of SSDs attached either mirror or individually.
>> ZFS will stripe across these in a raid0 or raid10 fashion depending on how
>> you configure.
> The ZIL code chains blocks together and these are allocated round robin among
> slogs or
> if they don't exist then the main pool devices.
>> To determine the true maximum streaming performance of the ZIL setting
>> sync=disabled will only use the in RAM ZIL. This gives up power protection
>> to synchronous writes.
> There is no RAM ZIL. If sync=disabled then all writes are asynchronous and
> are written
> as part of the periodic ZFS transaction group (txg) commit that occurs every
> 5 seconds.
>> Many SSDs do not help protect against power failure because they have their
>> own ram cache for writes. This effectively makes the SSD useless for this
>> purpose and potentially introduces a false sense of security. (These SSDs
>> are fine for L2ARC)
> The ZIL code issues a write cache flush to all devices it has written before
> from the system call. I've heard, that not all devices obey the flush but we
> consider them
> as broken hardware. I don't have a list to avoid.
>> Mirroring SSDs is only helpful if one SSD fails at the time of a power
>> failure. This leave several unanswered questions. How good is ZFS at
>> detecting that an SSD is no longer a reliable write target? The chance of
>> silent data corruption is well documented about spinning disks. What chance
>> of data corruption does this introduce with up to 10 seconds of data written
>> on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is
>> returning what we write to it?
> If the ZIL code gets a block write failure it will force the txg to commit
> before returning.
> It will depend on the drivers and IO subsystem as to how hard it tries to
> write the block.
>> Zpool versions 19 and higher should be able to survive a ZIL failure only
>> loosing the uncommitted data. However, I haven't seen good enough
>> information that I would necessarily trust this yet.
> This has been available for quite a while and I haven't heard of any bugs in
> this area.
>> Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs.
>> I'm not sure if that is current, but I can't find any reports of better
>> performance. I would suspect that DDR drive or Zeus RAM as ZIL would push
>> past this.
> 1GB/s seems very high, but I don't have any numbers to share.
It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
>> Anyone care to post their performance numbers on current hardware with E5
>> processors, and ram based ZIL solutions?
>> Thanks to everyone who has responded and contacted me directly on this issue.
>> On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel
>> <andrew.gabr...@cucumber.demon.co.uk> wrote:
>> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Schweiss, Chip
>> How can I determine for sure that my ZIL is my bottleneck? If it is the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
>> make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc.
>> Temporarily set sync=disabled
>> Or, depending on your application, leave it that way permanently. I know,
>> for the work I do, most systems I support at most locations have
>> sync=disabled. It all depends on the workload.
>> Noting of course that this means that in the case of an unexpected system
>> outage or loss of connectivity to the disks, synchronous writes since the
>> last txg commit will be lost, even though the applications will believe they
>> are secured to disk. (ZFS filesystem won't be corrupted, but it will look
>> like it's been wound back by up to 30 seconds when you reboot.)
>> This is fine for some workloads, such as those where you would start again
>> with fresh data and those which can look closely at the data to see how far
>> they got before being rudely interrupted, but not for those which rely on
>> the Posix semantics of synchronous writes/syncs meaning data is secured on
>> non-volatile storage when the function returns.
>> zfs-discuss mailing list
> zfs-discuss mailing list
zfs-discuss mailing list