On 26/09/2009, at 1:14 AM, Ross Walker wrote:

By any chance do you have copies=2 set?

No, only 1. So the double data going to the slog (as reported by iostat) is still confusing me and clearly potentially causing significant harm to my performance.

Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw

That’s an interesting concept. All data still appears to go via the slog device, however, under heavy load my responsive to a new write is typically below 2s (a few outliers at about 3.5s) and a read (directory listing of a non-cached entry) is about 2s.

What will this do once it hits the limit? Will streaming writes now be sent directly to a txg and streamed to the primary storage devices? (that is what I would like to see happen).

As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.


It seems that it doesn’t matter what the workload is if the NFS pipe can sustain more continuous throughput the slog chain can support.

I suppose some creative use of the logbias setting might assist this situation and force all potentially heavy writers directly to the primary storage. This would, however, negate any benefit for having a fast, low latency device for those filesystems for the times when it is desirable (any large batch of small writes, for example).

Is there a way to have a dynamic, auto logbias type setting depending on the transaction currently presented to the server such that if it is clearly a large streaming write it gets treated as logbias=throughput and if it is a small transaction it gets treated as logbias=latency? (i.e. such that NFS transactions can be effectively treated as if it was local storage but minorly breaking the benefits of the txg scheduling).

On 26/09/2009, at 3:39 AM, Richard Elling wrote:

Back of the envelope math says:
        10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
        int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
        1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
or so.

At this point, enter the fusionIO cards or similar devices. Unfortunately there does not seem to be anything on the market with infinitely fast write capacity (memory speeds) that is also supported under OpenSolaris as a slog device.

I think this is precisely what I (and anybody running a general purpose NFS server) need for a general purpose slog device.

Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]

How does reducing the txg commit interval really help? WIll data no longer go via the slog once it is streaming to disk? or will data still all be pushed through the slog regardless?

For a predominantly NFS server purpose, it really looks like a case of the slog has to outperform your main pool for continuous write speed as well as an instant response time as the primary criterion. Which might as well be a fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of them.

Is there also a way to throttle synchronous writes to the slog device? Much like the ZFS write throttling that is already implemented, so that there is a gap for new writers to enter when writing to the slog device? (or is this the norm and includes slog writes?)

cheers,
James

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to