On Fri, Jan 19, 2018 at 3:29 PM, Jim Wiggs <[email protected]> wrote: > > On Fri, Jan 19, 2018 at 1:03 PM, Mike Gerdts <[email protected]> > wrote: > >> On Fri, Jan 19, 2018 at 2:53 PM, Jim Wiggs <[email protected]> wrote: >> >>> >>> >>> On Fri, Jan 19, 2018 at 12:17 PM, Marsell K <[email protected]> wrote: >>> >>>> > If that weren't the case, I wouldn't have done what I was doing, >>>> because I'd be reducing the lifetime of my ZIL devices by 90% or more. >>>> >>>> I don't follow. You will reduce your slog's SSD endurance by using it >>>> for other things too, which is apparently what you're trying to do? >>>> >>>> >>> Yes, but the "other things" I want to use it for, L2ARC, are strongly >>> read-intensive. Only P/E cycles induce wear on an SSD, not repeated reads >>> of data that's already in the device. So, the vast majority of the "wear" >>> on an SSD that's split between ZIL and L2ARC is going to come from the ZIL >>> side of things. I haven't done the numbers, but I'd strongly suspect that >>> ZIL would generate at least an order of magnitude more writes to the device >>> than L2ARC, over the same time period. >>> >>> In any case, I will merely point out that Joyent has never used mirrored >>>> SSDs for the slog, albeit we made sure to use a quality SSD. You can see a >>>> series of Joyent BOMs here: https://eng.joyent.com/manufac >>>> turing/bom.html >>>> >>>> >>> So, what happens to that last ~5 seconds of data that you *thought* was >>> safely committed to your zpool if your non-mirrored quality SSD does in >>> fact glitch out and die, never to be seen again? >>> >>> >> So long as the system doesn't immediately crash, it still gets written to >> the data devices (for lack of better words) through the normal write path. >> When writes go to a log device ahead of writes to the data devices, the >> write to the data device comes from the same kernel buffer as was used for >> writing to the log device. >> > > OK, now *this* makes sense. Given this little tidbit of information, I > can see how the only failure mode that could actually cost you data would > be: *SSD fries*, immediately (as in: within microseconds) followed by > *complete power failure* so that ZFS never gets a chance to recognize the > log's failure and re-write the data -- which is still resident in the > kernel write buffer -- to the spinning rust directly. If the server is > properly protected from power surges by a good UPS/line conditioner setup, > the odds of that happening are pretty. damned. low. >
The separate log devices are write-only devices when things go well. They are read as the system boots to see what transactions had not yet made it to the data disks. If data were committed to the slog, the write returned success to the caller, then the slog failed, there could be several seconds (N * txg_timeout, N <= 3, txg_timeout typically 5 seconds) of data that could be lost should the kernel be unable to write to the data disks. The amount of data that can be written in 3 transaction groups is also a good estimate for how large your slog needs to be. If it is a 12 Gbit/s SAS device, the device needs to be about 180 GB (1.2 GB/s (accounting for 8b/10b encoding) * 3 txgs * 5 seconds) to be sure it never overflows to the log structures on the data disks. By going larger than that, you extend the life of the slog because the writes will be spread across more cells. My understanding, based on extensive experimentation and discussions with ZFS engineers, is that all sync writes go to an intent log as quickly as possible. If a log vdev (slog) is present and operational, it goes there. Otherwise, it goes to log structures on data disks. Regardless of whether you have a log vdev or not, the spa_sync_thread() bunches the data that has not yet been written into transaction groups and writes them out to the data disks. My testing showed that for purely synchronous writes (via iscsi) without a slog, the amount of data written to the data disks in a pool that didn't use raidz was approximately mirrors * network_read_bytes. Adding a log vdev split the writes rather evenly across the log device and the data vdevs. One greatly overlooked impact of having a slog is that it not only reduces latency by writing to a write-optimized device, but it also reduces the total number of writes that need to go to the data disks. This reduction of writes to the data disks can be a great help when the data disks are approaching iop or throughput limits. Anecdotes suggest that in some cases, a slog on spinning rust can provide measurable improvements for some synchronous write-intensive workloads. This bolsters the claim earlier in this thread that you don't need to have the best write-optimized SSD to benefit greatly from a log device. The important thing seems to be that it is at least as fast as the aggregate speed of the data vdevs. In the interest of "brevity" I've left out reads and how their contention for iops complicates things further. Mike ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
