Re: [smartos-discuss] Remove ZIL from zones pool?

Mike Gerdts Fri, 19 Jan 2018 15:07:54 -0800

On Fri, Jan 19, 2018 at 3:29 PM, Jim Wiggs <[email protected]> wrote:
>
> On Fri, Jan 19, 2018 at 1:03 PM, Mike Gerdts <[email protected]>
> wrote:
>
>> On Fri, Jan 19, 2018 at 2:53 PM, Jim Wiggs <[email protected]> wrote:
>>
>>>
>>>
>>> On Fri, Jan 19, 2018 at 12:17 PM, Marsell K <[email protected]> wrote:
>>>
>>>> > If that weren't the case, I wouldn't have done what I was doing,
>>>> because I'd be reducing the lifetime of my ZIL devices by 90% or more.
>>>>
>>>> I don't follow. You will reduce your slog's SSD endurance by using it
>>>> for other things too, which is apparently what you're trying to do?
>>>>
>>>>
>>> Yes, but the "other things" I want to use it for, L2ARC, are strongly
>>> read-intensive.  Only P/E cycles induce wear on an SSD, not repeated reads
>>> of data that's already in the device.  So, the vast majority of the "wear"
>>> on an SSD that's split between ZIL and L2ARC is going to come from the ZIL
>>> side of things.  I haven't done the numbers, but I'd strongly suspect that
>>> ZIL would generate at least an order of magnitude more writes to the device
>>> than L2ARC, over the same time period.
>>>
>>> In any case, I will merely point out that Joyent has never used mirrored
>>>> SSDs for the slog, albeit we made sure to use a quality SSD. You can see a
>>>> series of Joyent BOMs here: https://eng.joyent.com/manufac
>>>> turing/bom.html
>>>>
>>>>
>>> So, what  happens to that last ~5 seconds of data that you *thought* was
>>> safely committed to your zpool if your non-mirrored quality SSD does in
>>> fact glitch out and die, never to be seen again?
>>>
>>>
>> So long as the system doesn't immediately crash, it still gets written to
>> the data devices (for lack of better words) through the normal write path.
>> When writes go to a log device ahead of writes to the data devices, the
>> write to the data device comes from the same kernel buffer as was used for
>> writing to the log device.
>>
>
> OK, now *this* makes sense.  Given this little tidbit of information, I
> can see how the only failure mode that could actually cost you data would
> be: *SSD fries*, immediately (as in: within microseconds) followed by
> *complete power failure* so that ZFS never gets a chance to recognize the
> log's failure and re-write the data -- which is still resident in the
> kernel write buffer -- to the spinning rust directly.  If the server is
> properly protected from power surges by a good UPS/line conditioner setup,
> the odds of that happening are pretty. damned. low.
>


The separate log devices are write-only devices when things go well.  They
are read as the system boots to see what transactions had not yet made it
to the data disks.  If data were committed to the slog, the write returned
success to the caller, then the slog failed, there could be several seconds
(N * txg_timeout, N <= 3, txg_timeout typically 5 seconds) of data that
could be lost should the kernel be unable to write to the data disks.

The amount of data that can be written in 3 transaction groups is also a
good estimate for how large your slog needs to be.  If it is a 12 Gbit/s
SAS device, the device needs to be about 180 GB (1.2 GB/s (accounting for
8b/10b encoding) * 3 txgs * 5 seconds) to be sure it never overflows to the
log structures on the data disks.  By going larger than that, you extend
the life of the slog because the writes will be spread across more cells.

My understanding, based on extensive experimentation and discussions with
ZFS engineers, is that all sync writes go to an intent log as quickly as
possible.  If a log vdev (slog) is present and operational, it goes there.
Otherwise, it goes to log structures on data disks.  Regardless of whether
you have a log vdev or not, the spa_sync_thread() bunches the data that has
not yet been written into transaction groups and writes them out to the
data disks.

My testing showed that for purely synchronous writes (via iscsi) without a
slog, the amount of data written to the data disks in a pool that didn't
use raidz was approximately mirrors * network_read_bytes.  Adding a log
vdev split the writes rather evenly across the log device and the data
vdevs.

One greatly overlooked impact of having a slog is that it not only reduces
latency by writing to a write-optimized device, but it also reduces the
total number of writes that need to go to the data disks.  This reduction
of writes to the data disks can be a great help when the data disks are
approaching iop or throughput limits.

Anecdotes suggest that in some cases, a slog on spinning rust can provide
measurable improvements for some synchronous write-intensive workloads.
This bolsters the claim earlier in this thread that you don't need to have
the best write-optimized SSD to benefit greatly from a log device.  The
important thing seems to be that it is at least as fast as the aggregate
speed of the data vdevs.

In the interest of "brevity" I've left out reads and how their contention
for iops complicates things further.

Mike



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Re: [smartos-discuss] Remove ZIL from zones pool?

Reply via email to