On 25/09/2009, at 2:58 AM, Richard Elling wrote:

On Sep 23, 2009, at 10:00 PM, James Lever wrote:

So it turns out that the problem is that all writes coming via NFS are going through the slog. When that happens, the transfer speed to the device drops to ~70MB/s (the write speed of his SLC SSD) and until the load drops all new write requests are blocked causing a noticeable delay (which has been observed to be up to 20s, but generally only 2-4s).

Thank you sir, can I have another?
If you add (not attach) more slogs, the workload will be spread across them. But...

My log configurations is :

        logs
          c7t2d0s0   ONLINE       0     0     0
          c7t3d0s0   OFFLINE      0     0     0

I’m going to test the now removed SSD and see if I can get it to perform significantly worse than the first one, but my memory of testing these at pre-production testing was that they were both equally slow but not significantly different.

On a related note, I had 2 of these devices (both using just 10GB partitions) connected as log devices (so the pool had 2 separate log devices) and the second one was consistently running significantly slower than the first. Removing the second device made an improvement on performance, but did not remove the occasional observed pauses.

...this is not surprising, when you add a slow slog device. This is the weakest link rule.

So, in theory, even if one of the two SSDs was even slightly slower than the other, it would just appear that it would be more heavily effected?

Here is part of what I’m not understanding - unless one SSD is significantly worse than the other, how can the following scenario be true? Here is some iostat output from the two slog devices at 1s intervals when it gets a large series of write requests.

Idle at start.

0.0 1462.0 0.0 187010.2 0.0 28.6 0.0 19.6 2 83 0 0 0 0 c7t2d0 0.0 233.0 0.0 29823.7 0.0 28.7 0.0 123.3 0 83 0 0 0 0 c7t3d0

NVRAM cache close to full. (256MB BBC)

0.0 84.0 0.0 10622.0 0.0 3.5 0.0 41.2 0 12 0 0 0 0 c7t2d0 0.0 0.0 0.0 0.0 0.0 35.0 0.0 0.0 0 100 0 0 0 0 c7t3d0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 305.0 0.0 39039.3 0.0 35.0 0.0 114.7 0 100 0 0 0 0 c7t3d0


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 361.0 0.0 46208.1 0.0 35.0 0.0 96.8 0 100 0 0 0 0 c7t3d0


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 329.0 0.0 42114.0 0.0 35.0 0.0 106.3 0 100 0 0 0 0 c7t3d0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 317.0 0.0 40449.6 0.0 27.4 0.0 86.5 0 85 0 0 0 0 c7t3d0

0.0 4.0 0.0 263.8 0.0 0.0 0.0 0.2 0 0 0 0 0 0 c7t2d0 0.0 4.0 0.0 367.8 0.0 0.0 0.0 0.3 0 0 0 0 0 0 c7t3d0

What determines the size of the writes or distribution between slog devices? It looks like ZFS decided to send a large chunk to one slog which nearly filled the NVRAM, and then continue writing to the other one, which meant that it had to go at device speed (whatever that is for the data size/write size). Is there a way to tune the writes to multiple slogs to be (for arguments sake) 10MB slices?

I was of the (mis)understanding that only metadata and writes smaller than 64k went via the slog device in the event of an O_SYNC write request?

The threshold is 32 kBytes, which is unfortunately the same as the default
NFS write size. See CR6686887
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887

If you have a slog and logbias=latency (default) then the writes go to the slog. So there is some interaction here that can affect NFS workloads in particular.

Interesting CR.

nfsstat -m output on one of the linux hosts (ubuntu)

Flags: rw ,vers = 3 ,rsize = 1048576 ,wsize = 1048576 ,namlen = 255 ,hard ,nointr ,noacl ,proto = tcp ,timeo = 600 ,retrans =2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17

rsize and wsize auto tuned to 1MB. How does this effect the sync request threshold?

The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that when I perform a large synchronous write, the data does not go via the slog device?

You can change the IOP size on the client.


You’re suggesting modifying rsize/wsize?  or something else?

cheers,
James

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to