Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Richard Elling
[Phil beat me to it]
Yes, the 0s are a result of integer division in DTrace/kernel.

On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:

 Indeed they are there, shown with 1 second interval.  So, it is the
 client's fault after all.  I'll have to see whether it is somehow
 possible to get the server to write cached data sooner (and hopefully
 asynchronous), and the client to issue commits less often.  Luckily I
 can live with the current behavior (and the SSDs shouldn't give out
 any time soon even being used like this), if it isn't possible to
 change it.

If this is the proposed workload, then it is possible to tune the DMU to
manage commits more efficiently. In an ideal world, it does this automatically,
but the algorithms are based on a bandwidth calculation and those are not
suitable for HDD capacity planning. The efficiency goal would be to do less
work, more often and there are two tunables that can apply:

1. the txg_timeout controls the default maximum transaction group commit
interval and is set to 5 seconds on modern ZFS implementations.

2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg 
will
be committed when the size reaches this limit, rather than waiting for the
txg_timeout. For streaming writes, this can work better than tuning the 
txg_timeout.

 -- richard

 
 Thanks for all the help,
 Tim
 
 On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman phil.har...@gmail.com wrote:
 On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote:
 
 The client is using async writes, that include commits. Sync writes do not
 need commits.
 
 Are you saying nfs commit operations sent by the client aren't always
 reported by that script?
 
 They are not reported in your case because the commit rate is less than one 
 per second.
 
 DTrace is an amazing tool, but it does dictate certain coding compromises, 
 particularly when it comes to output scaling, grouping, sorting and 
 formatting.
 
 In this script the commit rate is calculated using integer division. In your 
 case the sample interval is 5 seconds, so up to 4 commits per second will be 
 reported as a big fat zero.
 
 If you use a sample interval of 1 second you should see occasional commits. 
 We know they are there because we see a non-zero commit time.
 
 

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Richard Elling
On Jun 14, 2012, at 1:35 PM, Robert Milkowski wrote:

 The client is using async writes, that include commits. Sync writes do
 not need commits.
 
 What happens is that the ZFS transaction group commit occurs at more-
 or-less regular intervals, likely 5 seconds for more modern ZFS
 systems. When the commit occurs, any data that is in the ARC but not
 commited in a prior transaction group gets sent to the ZIL
 
 Are you sure? I don't think this is the case unless I misunderstood you or
 this is some recent change to Illumos.

Need to make sure we are clear here, there is time between the txg being
closed and the txg being on disk. During that period, a sync write of the
data in the closed txg is written to the ZIL.

 Whatever is being committed when zfs txg closes goes directly to pool and
 not to zil. Only sync writes will go to zil right a way (and not always, see
 logbias, etc.) and to arc to be committed later to a pool when txg closes.

In this specific case, there are separate log devices, so logbias doesn't apply.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Timothy Coalson
Thanks for the suggestions.  I think it would also depend on whether
the nfs server has tried to write asynchronously to the pool in the
meantime, which I am unsure how to test, other than making the txgs
extremely frequent and watching the load on the log devices.  As for
the integer division giving misleading zeros, one possible solution is
to add (delay-1) to the count before dividing by delay, so if there
are any, it will show at least 1 (or you could get fancy and do fixed
point numbers).

As for very frequent txgs, I imagine this could cause more
fragmentation (more metadata written and discarded more frequently),
is there a way to estimate or test for the impact of it?  Depending on
how it allocates the metadata blocks, I suppose it could write it to
the blocks recently vacated by old metadata due to the previous txg,
and have almost no impact until a snapshot is taken, is it smart
enough to do this?

Tim

On Fri, Jun 15, 2012 at 10:56 AM, Richard Elling
richard.ell...@gmail.com wrote:
 [Phil beat me to it]
 Yes, the 0s are a result of integer division in DTrace/kernel.

 On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote:

 Indeed they are there, shown with 1 second interval.  So, it is the
 client's fault after all.  I'll have to see whether it is somehow
 possible to get the server to write cached data sooner (and hopefully
 asynchronous), and the client to issue commits less often.  Luckily I
 can live with the current behavior (and the SSDs shouldn't give out
 any time soon even being used like this), if it isn't possible to
 change it.

 If this is the proposed workload, then it is possible to tune the DMU to
 manage commits more efficiently. In an ideal world, it does this 
 automatically,
 but the algorithms are based on a bandwidth calculation and those are not
 suitable for HDD capacity planning. The efficiency goal would be to do less
 work, more often and there are two tunables that can apply:

 1. the txg_timeout controls the default maximum transaction group commit
 interval and is set to 5 seconds on modern ZFS implementations.

 2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg 
 will
 be committed when the size reaches this limit, rather than waiting for the
 txg_timeout. For streaming writes, this can work better than tuning the
 txg_timeout.

  -- richard


 Thanks for all the help,
 Tim

 On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman phil.har...@gmail.com wrote:
 On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote:

 The client is using async writes, that include commits. Sync writes do not
 need commits.

 Are you saying nfs commit operations sent by the client aren't always
 reported by that script?

 They are not reported in your case because the commit rate is less than one 
 per second.

 DTrace is an amazing tool, but it does dictate certain coding compromises, 
 particularly when it comes to output scaling, grouping, sorting and 
 formatting.

 In this script the commit rate is calculated using integer division. In 
 your case the sample interval is 5 seconds, so up to 4 commits per second 
 will be reported as a big fat zero.

 If you use a sample interval of 1 second you should see occasional commits. 
 We know they are there because we see a non-zero commit time.



 --

 ZFS and performance consulting
 http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-15 Thread Timothy Coalson
On Fri, Jun 15, 2012 at 12:56 PM, Timothy Coalson tsc...@mst.edu wrote:
 Thanks for the suggestions.  I think it would also depend on whether
 the nfs server has tried to write asynchronously to the pool in the
 meantime, which I am unsure how to test, other than making the txgs
 extremely frequent and watching the load on the log devices.

I didn't want to reboot the main file server to test this, so I used
zilstat on the backup nfs server (which has nearly identical hardware
and configuration, but doesn't have SSDs for a separate ZIL) to see if
I could estimate the difference it would make, and the story got
stranger: it wrote far less data to the ZIL for the same copy
operation (single 8GB file):

$ sudo ./zilstat -M -l 20 -p backuppool txg
waiting for txg commit...
   txg   N-MB N-MB/s N-Max-Rate   B-MB B-MB/s
B-Max-Rateops  =4kB 4-32kB =32kB
   2833307  1  0  1  1  0
1 15  0  0 15
   2833308  0  0  0  0  0
0  0  0  0  0
   2833309  1  0  1  1  0
1  8  0  0  8
   2833310  0  0  0  0  0
0  4  0  0  4
   2833311  1  0  0  1  0
0  9  0  0  9
   2833312  0  0  0  0  0
0  0  0  0  0
   2833313  2  0  2  2  0
2 21  0  0 21
   2833314  7  1  7  8  1
8 63  0  0 63
   2833315  1  0  1  2  0
2 18  0  0 18
   2833316  0  0  0  0  0
0  5  0  0  5

A small sample from the server with SSD log devices doing the same operation:

$ sudo ./zilstat -M -l 20 -p mainpool txg
waiting for txg commit...
   txg   N-MB N-MB/s N-Max-Rate   B-MB B-MB/s
B-Max-Rateops  =4kB 4-32kB =32kB
   2808483989197593   1967393
 1180  15010  0  0  15010
   2808484599 99208   1134189
  393   8653  0  0   8653
   2808485  0  0  0  0  0
0  0  0  0  0
   2808486137 27126255 51
  235   1953  0  0   1953
   2808487460 92460859171
  859   6555  0  0   6555
   2808488530 75530   1031147
 1031   7871  0  0   7871

Setting logbias=throughput makes the server with the SSD log devices
act the same as the server without them, as far as I can tell, which I
somewhat expected.  However, I did not expect use of separate log
devices to change how often ZIL ops are performed, other than to raise
the upper limit if the device can service more IOPS.  Additionally,
nfssvrtop showed a lower value for Com_t when not using the separate
log device (2.1s with logbias=latency, 0.24s with throughput).
Copying a folder with small files and subdirectories pushes the server
to ~400 ZIL ops per txg with logbias=throughput, so it shouldn't be
the device performance making it only issue ~15 ops per txg copying a
large file without using a separate log device.  I am thinking of
transplanting one of the SSDs temporarily for testing, but I would be
interested to know the cause of this behavior.  I don't know why more
asynchronous writes seem to be making it into txgs without being
caught by an nfs commit when a separate log device isn't used.

Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Richard Elling
On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:

 On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
 client: ubuntu 11.10
 /etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
 bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
0
 
 nfsvers=3
 
 NAME  PROPERTY  VALUE SOURCE
 mainpool/storage  sync  standard  default
 
 sync=standard
 
 This is expected behaviour for this combination. NFS 3 semantics are
 for persistent writes at the server regardless - and mostly also 
 for NFS 4.

NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4
async and sync activity, try nfssvrtop
https://github.com/richardelling/tools

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Timothy Coalson
Thanks for the script.  Here is some sample output from 'sudo
./nfssvrtop -b 512 5' (my disks are 512B-sector emulated and the pool
is ashift=9, some benchmarking didn't show much difference with
ashift=12 other than giving up 8% of available space) during a copy
operation from 37.30 with sync=standard:

2012 Jun 14 13:59:13, load: 0.68, read: 0KB, swrite: 0
KB, awrite: 557056   KB
Ver Client   NFSOPS   Reads SWrites AWrites Commits
Rd_bw  SWr_bw  AWr_bwRd_t   SWr_t   AWr_t   Com_t  Align%
3   xxx.xxx.37.30   108   0   0 108   0
0   0  111206   0   0 396 1917419 100
a bit later...
3   xxx.xxx.37.30   109   0   0 108   0
0   0  111411   0   0 427   0 100

sample output from the end of 'zpool iostat -v 5 mainpool' concurrently:
logs   -  -  -  -  -  -
  c31t3d0s0 260M  9.68G  0  1.21K  0  85.3M
  c31t4d0s0 260M  9.68G  0  1.21K  0  85.1M

In case the alignment fails, the nonzero entries are under NFSOPS,
AWrites, AWr_bw, AWr_t, Com_t and Align%.  The Com_t (average commit
time?) column alternates between zero and a million or two (the other
columns stay about the same, the zeros stay zero), while the Commits
column stays zero during the copy.  The write throughput to the logs
varies quite a bit, that sample is a very high mark, it mainly
alternates between almost zero and 30M each, which is kind of odd
considering the copy speed (using gigabit network, copy speed averages
around 110MB/s).

When I 'zfs set sync=disabled', the output of nfssrvtop stays about
the same, except the Com_t stays 0, and the log devices also stay 0
for throughput.  Could you enlighten me as to what Com_t measures
when Commits stays zero?  Perhaps the nfs server caches asynchronous
nfs writes how I expect, but flushes its cache with synchronous
writes?

Thanks,
Tim

On Thu, Jun 14, 2012 at 9:24 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote:

 On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
 client: ubuntu 11.10
 /etc/fstab entry: server:/mainpool/storage       /mnt/myelin     nfs
 bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async       0
    0

 nfsvers=3

 NAME              PROPERTY  VALUE     SOURCE
 mainpool/storage  sync      standard  default

 sync=standard

 This is expected behaviour for this combination. NFS 3 semantics are
 for persistent writes at the server regardless - and mostly also
 for NFS 4.

 NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4
 async and sync activity, try nfssvrtop
        https://github.com/richardelling/tools

  -- richard

 --

 ZFS and performance consulting
 http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Richard Elling
Hi Tim,

On Jun 14, 2012, at 12:20 PM, Timothy Coalson wrote:

 Thanks for the script.  Here is some sample output from 'sudo
 ./nfssvrtop -b 512 5' (my disks are 512B-sector emulated and the pool
 is ashift=9, some benchmarking didn't show much difference with
 ashift=12 other than giving up 8% of available space) during a copy
 operation from 37.30 with sync=standard:
 
 2012 Jun 14 13:59:13, load: 0.68, read: 0KB, swrite: 0
 KB, awrite: 557056   KB
 Ver Client   NFSOPS   Reads SWrites AWrites Commits Rd_bw  SWr_bw 
  AWr_bwRd_t   SWr_t   AWr_t   Com_t  Align%
 3   xxx.xxx.37.30   108   0   0 108   00   0  
 111206   0   0 396 1917419 100
 a bit later...
 3   xxx.xxx.37.30   109   0   0 108   00   0  
 111411   0   0 427   0 100
 
 sample output from the end of 'zpool iostat -v 5 mainpool' concurrently:
 logs   -  -  -  -  -  -
  c31t3d0s0 260M  9.68G  0  1.21K  0  85.3M
  c31t4d0s0 260M  9.68G  0  1.21K  0  85.1M
 
 In case the alignment fails, the nonzero entries are under NFSOPS,
 AWrites, AWr_bw, AWr_t, Com_t and Align%.  The Com_t (average commit
 time?) column alternates between zero and a million or two (the other
 columns stay about the same, the zeros stay zero), while the Commits
 column stays zero during the copy.  The write throughput to the logs
 varies quite a bit, that sample is a very high mark, it mainly
 alternates between almost zero and 30M each, which is kind of odd
 considering the copy speed (using gigabit network, copy speed averages
 around 110MB/s).

The client is using async writes, that include commits. Sync writes do not
need commits.

What happens is that the ZFS transaction group commit occurs at more-or-less
regular intervals, likely 5 seconds for more modern ZFS systems. When the 
commit occurs, any data that is in the ARC but not commited in a prior 
transaction
group gets sent to the ZIL. This is why you might see a very different amount of
ZIL activity relative to the expected write workload.

 When I 'zfs set sync=disabled', the output of nfssrvtop stays about
 the same, except the Com_t stays 0, and the log devices also stay 0
 for throughput.  Could you enlighten me as to what Com_t measures
 when Commits stays zero?  Perhaps the nfs server caches asynchronous
 nfs writes how I expect, but flushes its cache with synchronous
 writes?
 

With sync=disabled, the ZIL is not used, thus the commit response to the client
is a lie, breaking the covenant between the server and client. In other words, 
the server is supposed to respond to the commit only when the data is written
to permanent media, but the administrator overruled this action by disabling
the ZIL. If the server was to unexpectedly restart or other conditions occur
such that the write cannot be completed, then the server and client will have
different views of the data, a form of data loss.

Different applications can react to long commit times differently. In this 
example,
we see 1.9 seconds for the commit versus about 400 microseconds for each 
async write. The cause of the latency of the commit is not apparent from any
bandwidth measurements (eg zpool iostat) and you should consider looking 
more closely at the iostat -x latency to see if the log devices are performing
well. 
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Robert Milkowski
 The client is using async writes, that include commits. Sync writes do
 not need commits.
 
 What happens is that the ZFS transaction group commit occurs at more-
 or-less regular intervals, likely 5 seconds for more modern ZFS
 systems. When the commit occurs, any data that is in the ARC but not
 commited in a prior transaction group gets sent to the ZIL

Are you sure? I don't think this is the case unless I misunderstood you or
this is some recent change to Illumos.

Whatever is being committed when zfs txg closes goes directly to pool and
not to zil. Only sync writes will go to zil right a way (and not always, see
logbias, etc.) and to arc to be committed later to a pool when txg closes.


-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Timothy Coalson
 The client is using async writes, that include commits. Sync writes do not
 need commits.

Are you saying nfs commit operations sent by the client aren't always
reported by that script?

 What happens is that the ZFS transaction group commit occurs at more-or-less
 regular intervals, likely 5 seconds for more modern ZFS systems. When the
 commit occurs, any data that is in the ARC but not commited in a prior 
 transaction
 group gets sent to the ZIL. This is why you might see a very different amount 
 of
 ZIL activity relative to the expected write workload.

I thought this was the other way around, synchronous writes
immediately get copied to ZIL blocks (logbias may cause them to be
allocated from the data disks despite the existence of a separate log
device) in order to not require that a txg go through immediately to
service them, then a txg that includes the same data eventually goes
directly to the data disks, and then they get removed from the ZIL.

 With sync=disabled, the ZIL is not used, thus the commit response to the 
 client
 is a lie, breaking the covenant between the server and client.

I was not clear, the behavior with sync=disabled was exactly what I
expected it to be, my question was why the Com_t attribute was nonzero
with sync=standard, while Commits stayed zero.  To be clear, while it
was in the middle of copying data, the Commits column always read
zero, never even one commit reported (in fact, Commits was always
zero, even when a copy completed).  I would expect this to mean the
nfs client was issuing asynchronous writes without a single commit
request (eventually filling the server's cache and requiring the
server to do something to resolve it).  I am copying a single large
file, 8GB, and usually interrupting it before it finishes, after I
have some data that matches the period where it is writing to the log
devices (takes a few seconds to get through the client and server
caches and start using the disks).

 Different applications can react to long commit times differently. In this 
 example,
 we see 1.9 seconds for the commit versus about 400 microseconds for each
 async write. The cause of the latency of the commit is not apparent from any
 bandwidth measurements (eg zpool iostat) and you should consider looking
 more closely at the iostat -x latency to see if the log devices are 
 performing
 well.

I am still unclear on the commit time being reported for zero reported
commit requests.  Here is what I think are the log devices (I am
unsure how to translate the sd## mapping to solaris style
controller/target devices) under iostat -x 5 when they are being
written to heavily:

devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd30  0.0 1073.10.0 72940.6  0.0  7.67.1   1  77
sd31  0.0 1075.30.0 73090.2  0.0  7.67.1   1  77

Considering their maximum write speed is reported as 90MB/s, and they
each wrote 70MB/s in that interval, I do not expect the svc_t to
reflect the seek time they can achieve.  There were no particularly
suspicious spikes in svc_t on any device while the Com_t was large.  I
have a guess that your script is reporting the time taken to flush a
cache that got filled with async writes, and needed to be emptied to
be able to continue to service async writes (a lot of data, therefore
a lot of time).

I'm wondering if this belongs in nfs-discuss, if the nfs server issues
synchronous writes to empty its cache (without receiving an nfs commit
request), it seems to explain everything.

Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Phil Harman
On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote:

 The client is using async writes, that include commits. Sync writes do not
 need commits.
 
 Are you saying nfs commit operations sent by the client aren't always
 reported by that script?

They are not reported in your case because the commit rate is less than one per 
second.

DTrace is an amazing tool, but it does dictate certain coding compromises, 
particularly when it comes to output scaling, grouping, sorting and formatting.

In this script the commit rate is calculated using integer division. In your 
case the sample interval is 5 seconds, so up to 4 commits per second will be 
reported as a big fat zero.

If you use a sample interval of 1 second you should see occasional commits. We 
know they are there because we see a non-zero commit time.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Timothy Coalson
Indeed they are there, shown with 1 second interval.  So, it is the
client's fault after all.  I'll have to see whether it is somehow
possible to get the server to write cached data sooner (and hopefully
asynchronous), and the client to issue commits less often.  Luckily I
can live with the current behavior (and the SSDs shouldn't give out
any time soon even being used like this), if it isn't possible to
change it.

Thanks for all the help,
Tim

On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman phil.har...@gmail.com wrote:
 On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote:

 The client is using async writes, that include commits. Sync writes do not
 need commits.

 Are you saying nfs commit operations sent by the client aren't always
 reported by that script?

 They are not reported in your case because the commit rate is less than one 
 per second.

 DTrace is an amazing tool, but it does dictate certain coding compromises, 
 particularly when it comes to output scaling, grouping, sorting and 
 formatting.

 In this script the commit rate is calculated using integer division. In your 
 case the sample interval is 5 seconds, so up to 4 commits per second will be 
 reported as a big fat zero.

 If you use a sample interval of 1 second you should see occasional commits. 
 We know they are there because we see a non-zero commit time.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Timothy Coalson
I noticed recently that the SSDs hosting the ZIL for my pool had a large
number in the SMART attribute for total LBAs written (with some
calculation, it seems to be the total amount of data written to the pool so
far), did some testing, and found that the ZIL is being used quite heavily
(matching the writing speed) on writes that should be asynchronous.  I did
a capture with wireshark during a simple copy of a large file (8GB), and
both the write packets and the write responses showed the UNSTABLE I
would expect in asynchronous writes.  The copy didn't finish before the
server started writing heavily to the ZIL, so I wouldn't expect it to tell
NFS to commit then.  Here are some relevant pieces of info (with hostnames,
etc removed):

client: ubuntu 11.10
/etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
0

server: OpenIndiana oi_151a4
$ zfs get sync mainpool
NAME  PROPERTY  VALUE SOURCE
mainpool  sync  standard  default
$ zfs get sync mainpool/storage
NAME  PROPERTY  VALUE SOURCE
mainpool/storage  sync  standard  default
$ zfs get sharenfs mainpool/storage
NAME  PROPERTY  VALUE   SOURCE
mainpool/storage  sharenfs  rw=@xxx.xxx.37  local
$ zpool get version mainpool
NAME  PROPERTY  VALUESOURCE
mainpool  version   28   default

The pool consists of 24 sata disks arranged as 2 raid-z2 groups of 12, and
originally had a mirrored log across 10GB slices on two intel 320 80GB
SSDs, but I have since rearranged the logs as non-mirrored (since a single
SSD doesn't quite keep up with gigabit network throughput, and some testing
convinced me that a pool survives a failing log device gracefully, as long
as there isn't a simultaneous crash).  I would like to switch them back to
a mirrored configuration, but without impacting the asynchronous
throughput, and obviously it would be nice to reduce the write load to them
so they live longer.  As far as I can tell, all nfs writes are being
written to the ZIL when many should be cached in memory and bypass the ZIL.

Any help appreciated,
Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Daniel Carosone
On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
 client: ubuntu 11.10
 /etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
 bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
 0

nfsvers=3

 NAME  PROPERTY  VALUE SOURCE
 mainpool/storage  sync  standard  default

sync=standard

This is expected behaviour for this combination. NFS 3 semantics are
for persistent writes at the server regardless - and mostly also 
for NFS 4.

The async client mount option relates to when the writes get shipped
to the server (immediately or delayed in dirty pages), rather than to
how the server should handle those writes once they arrive.

You could set sync=disabled if you're happy with the consequences, or
even just as a temporary test to confirm the impact.  It sounds like
you would be since that's what you're trying to achieve.

There is a difference: async on the client means data is lost on a
client reboot, async on the server means data may be lost on a server
reboot (and the client/application confused by inconsistencies as a
result). 

Separate datasets (and mounts) for data with different persistence
requirements can help.

--
Dan.




pgp4Kh9cv3BTt.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Timothy Coalson
Interesting...from what I had read about NFSv3 asynchronous writes,
especially bits about does not require the server to commit to stable
storage, led me to expect different behavior.  The performance impact
on large writes (which we do a lot of) wasn't severe, so sync=disabled
is probably not worth the risk.  The SSDs should also be able to take
a fair amount of abuse, so I can live with the behavior as-is.

Here's one example of documentation that led me to expect something
else: http://nfs.sourceforge.net/ , search for unstable.  It did
indeed seem like the nfs client was delaying writes, then doing
synchronous nfs calls, which is why I looked into the packets with
wireshark (and found them to be advertised as UNSTABLE, but with the
pool acting synchronous).

Thanks,
Tim

On Wed, Jun 13, 2012 at 6:51 PM, Daniel Carosone d...@geek.com.au wrote:

 On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
  client: ubuntu 11.10
  /etc/fstab entry: server:/mainpool/storage       /mnt/myelin     nfs
  bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async
  0
      0

 nfsvers=3

  NAME              PROPERTY  VALUE     SOURCE
  mainpool/storage  sync      standard  default

 sync=standard

 This is expected behaviour for this combination. NFS 3 semantics are
 for persistent writes at the server regardless - and mostly also
 for NFS 4.

 The async client mount option relates to when the writes get shipped
 to the server (immediately or delayed in dirty pages), rather than to
 how the server should handle those writes once they arrive.

 You could set sync=disabled if you're happy with the consequences, or
 even just as a temporary test to confirm the impact.  It sounds like
 you would be since that's what you're trying to achieve.

 There is a difference: async on the client means data is lost on a
 client reboot, async on the server means data may be lost on a server
 reboot (and the client/application confused by inconsistencies as a
 result).

 Separate datasets (and mounts) for data with different persistence
 requirements can help.

 --
 Dan.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss