Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Casper . Dik

On Fri, 25 Sep 2009, James Lever wrote:

 NFS Version 3 introduces the concept of safe asynchronous writes.?

Being safe then requires a responsibilty level on the client which 
is often not present.  For example, if the server crashes, and then 
the client crashes, how does the client resend the uncommitted data? 
If the client had a non-volatile storage cache, then it would be able 
to responsibly finish the writes that failed.

If the client crashes, it is clear that work will be lost up to the point
that the client did a successful commit.  Other than support for the
NFSv3 commit operation and resending the missing operations.
If the client crashes, we know that non-committed operations may be dropped
in the floor.

The commentary says that normally the COMMIT operations occur during 
close(2) or fsync(2) system call, or when encountering memory 
pressure.  If the problem is slow copying of many small files, this 
COMMIT approach does not help very much since very little data is sent 
per file and most time is spent creating directories and files.

Indeed; the commit is mostly to make sure that the pipe between the server
and the client can be filled for write operations.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Thu, Sep 24, 2009 at 11:29 PM, James Lever j...@jamver.id.au wrote:

 On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:

 The commentary says that normally the COMMIT operations occur during
 close(2) or fsync(2) system call, or when encountering memory pressure.  If
 the problem is slow copying of many small files, this COMMIT approach does
 not help very much since very little data is sent per file and most time is
 spent creating directories and files.

 The problem appears to be slog bandwidth exhaustion due to all data being
 sent via the slog creating a contention for all following NFS or locally
 synchronous writes.  The NFS writes do not appear to be synchronous in
 nature - there is only a COMMIT being issued at the very end, however, all
 of that data appears to be going via the slog and it appears to be inflating
 to twice its original size.
 For a test, I just copied a relatively small file (8.4MB in size).  Looking
 at a tcpdump analysis using wireshark, there is a SETATTR which ends with a
 V3 COMMIT and no COMMIT messages during the transfer.
 iostat output that matches looks like this:
 slog write of the data (17MB appears to hit the slog)
[snip]
 then a few seconds later, the transaction group gets flushed to primary
 storage writing nearly 11.4MB which is inline with raid Z2 (expect around
 10.5MB; 8.4/8*10):
[snip]
 So I performed the same test with a much larger file (533MB) to see what it
 would do, being larger than the NVRAM cache in front of the SSD.  Note that
 after the second second of activity the NVRAM is full and only allowing in
 about the sequential write speed of the SSD (~70MB/s).
[snip]
 Again, the slog wrote about double the file size (1022.6MB) and a few
 seconds later, the data was pushed to the primary storage (684.9MB with an
 expectation of 666MB = 533MB/8*10) so again about the right number hit the
 spinning platters.
[snip]
 Can anybody explain what is going on with the slog device in that all data
 is being shunted via it and why about double the data size is being written
 to it per transaction?

By any chance do you have copies=2 set?

That will make 2 transactions of 1.

Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw

Set the PERC flush interval to say 1 second.

As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.  If 
the origin is NFS and the SSD has considerably more sustained write 
bandwidth than the ethernet transfer bandwidth, then using the SSD is 
a win.  If the SSD accepts data slower than the ethernet can deliver 
it (which seems to be this particular case) then the SSD is not 
helping.


If the ethernet can pass 100MB/second, then the sustained write 
specification for the SSD needs to be at least 100MB/second.  Since 
data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it 
to ZFS, the SSD should support write bursts of at least double that or 
else it will not be helping bulk-write performance.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Richard Elling

On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:


On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.   
If the
origin is NFS and the SSD has considerably more sustained write  
bandwidth
than the ethernet transfer bandwidth, then using the SSD is a win.   
If the
SSD accepts data slower than the ethernet can deliver it (which  
seems to be

this particular case) then the SSD is not helping.

If the ethernet can pass 100MB/second, then the sustained write
specification for the SSD needs to be at least 100MB/second.  Since  
data is
buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to  
ZFS, the
SSD should support write bursts of at least double that or else it  
will not

be helping bulk-write performance.


Specifically I was talking NFS as that was what the OP was talking
about, but yes it does depend on the origin, but you also assume that
NFS IO goes over only a single 1Gbe interface when it could be over
multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
interfaces. You also assume the IO recorded in the ZIL is just the raw
IO when there is also meta-data or multiple transaction copies as
well.

Personnally I still prefer to spread the ZIL across the pool and have
a large NVRAM backed HBA as opposed to an slog which really puts all
my IO in one basket. If I had a pure NVRAM device I might consider
using that as an slog device, but SSDs are too variable for my taste.


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
or so.

Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread James Lever


On 26/09/2009, at 1:14 AM, Ross Walker wrote:


By any chance do you have copies=2 set?


No, only 1.  So the double data going to the slog (as reported by  
iostat) is still confusing me and clearly potentially causing  
significant harm to my performance.



Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw


That’s an interesting concept.  All data still appears to go via the  
slog device, however, under heavy load my responsive to a new write is  
typically below 2s (a few outliers at about 3.5s) and a read  
(directory listing of a non-cached entry) is about 2s.


What will this do once it hits the limit?  Will streaming writes now  
be sent directly to a txg and streamed to the primary storage  
devices?  (that is what I would like to see happen).



As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.



It seems that it doesn’t matter what the workload is if the NFS pipe  
can sustain more continuous throughput the slog chain can support.


I suppose some creative use of the logbias setting might assist this  
situation and force all potentially heavy writers directly to the  
primary storage.  This would, however, negate any benefit for having a  
fast, low latency device for those filesystems for the times when it  
is desirable (any large batch of small writes, for example).


Is there a way to have a dynamic, auto logbias type setting depending  
on the transaction currently presented to the server such that if it  
is clearly a large streaming write it gets treated as  
logbias=throughput and if it is a small transaction it gets treated as  
logbias=latency?  (i.e. such that NFS transactions can be effectively  
treated as if it was local storage but minorly breaking the benefits  
of the txg scheduling).


On 26/09/2009, at 3:39 AM, Richard Elling wrote:


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30  
GBytes

or so.


At this point, enter the fusionIO cards or similar devices.   
Unfortunately there does not seem to be anything on the market with  
infinitely fast write capacity (memory speeds) that is also supported  
under OpenSolaris as a slog device.


I think this is precisely what I (and anybody running a general  
purpose NFS server) need for a general purpose slog device.



Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]


How does reducing the txg commit interval really help?  WIll data no  
longer go via the slog once it is streaming to disk?  or will data  
still all be pushed through the slog regardless?


For a predominantly NFS server purpose, it really looks like a case of  
the slog has to outperform your main pool for continuous write speed  
as well as an instant response time as the primary criterion. Which  
might as well be a fast (or group of fast) SSDs or 15kRPM drives with  
some NVRAM in front of them.


Is there also a way to throttle synchronous writes to the slog  
device?  Much like the ZFS write throttling that is already  
implemented, so that there is a gap for new writers to enter when  
writing to the slog device? (or is this the norm and includes slog  
writes?)


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 5:24 PM, James Lever j...@jamver.id.au wrote:

 On 26/09/2009, at 1:14 AM, Ross Walker wrote:

 By any chance do you have copies=2 set?

 No, only 1.  So the double data going to the slog (as reported by iostat) is
 still confusing me and clearly potentially causing significant harm to my
 performance.

Weird then, I thought that would be an easy explaination.

 Also, try setting zfs_write_limit_override equal to the size of the
 NVRAM cache (or half depending on how long it takes to flush):

 echo zfs_write_limit_override/W0t268435456 | mdb -kw

 That’s an interesting concept.  All data still appears to go via the slog
 device, however, under heavy load my responsive to a new write is typically
 below 2s (a few outliers at about 3.5s) and a read (directory listing of a
 non-cached entry) is about 2s.

 What will this do once it hits the limit?  Will streaming writes now be sent
 directly to a txg and streamed to the primary storage devices?  (that is
 what I would like to see happen).

It's sets the max size of a txg to the given size. When it hits that
number it flushes to disk.

 As a side an slog device will not be too beneficial for large
 sequential writes, because it will be throughput bound not latency
 bound. slog devices really help when you have lots of small sync
 writes. A RAIDZ2 with the ZIL spread across it will provide much
 higher throughput then an SSD. An example of a workload that benefits
 from an slog device is ESX over NFS, which does a COMMIT for each
 block written, so it benefits from an slog, but a standard media
 server will not (but an L2ARC would be beneficial).

 Better workload analysis is really what it is about.


 It seems that it doesn’t matter what the workload is if the NFS pipe can
 sustain more continuous throughput the slog chain can support.

Only on large sequentials, small sync IO should benefit from the slog.

 I suppose some creative use of the logbias setting might assist this
 situation and force all potentially heavy writers directly to the primary
 storage.  This would, however, negate any benefit for having a fast, low
 latency device for those filesystems for the times when it is desirable (any
 large batch of small writes, for example).

 Is there a way to have a dynamic, auto logbias type setting depending on the
 transaction currently presented to the server such that if it is clearly a
 large streaming write it gets treated as logbias=throughput and if it is a
 small transaction it gets treated as logbias=latency?  (i.e. such that NFS
 transactions can be effectively treated as if it was local storage but
 minorly breaking the benefits of the txg scheduling).

I'll leave that to the Sun guys to answer.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:

 On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us wrote:

 On Fri, 25 Sep 2009, Ross Walker wrote:

 As a side an slog device will not be too beneficial for large
 sequential writes, because it will be throughput bound not latency
 bound. slog devices really help when you have lots of small sync
 writes. A RAIDZ2 with the ZIL spread across it will provide much

 Surely this depends on the origin of the large sequential writes.  If the
 origin is NFS and the SSD has considerably more sustained write bandwidth
 than the ethernet transfer bandwidth, then using the SSD is a win.  If
 the SSD accepts data slower than the ethernet can deliver it (which seems to
 be this particular case) then the SSD is not helping.

 If the ethernet can pass 100MB/second, then the sustained write
 specification for the SSD needs to be at least 100MB/second.  Since data
 is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the
 SSD should support write bursts of at least double that or else it will
 not be helping bulk-write performance.

 Specifically I was talking NFS as that was what the OP was talking
 about, but yes it does depend on the origin, but you also assume that
 NFS IO goes over only a single 1Gbe interface when it could be over
 multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
 interfaces. You also assume the IO recorded in the ZIL is just the raw
 IO when there is also meta-data or multiple transaction copies as
 well.

 Personnally I still prefer to spread the ZIL across the pool and have
 a large NVRAM backed HBA as opposed to an slog which really puts all
 my IO in one basket. If I had a pure NVRAM device I might consider
 using that as an slog device, but SSDs are too variable for my taste.

 Back of the envelope math says:
        10 Gbe = ~1 GByte/sec of I/O capacity

 If the SSD can only sink 70 MByte/s, then you will need:
        int(1000/70) + 1 = 15 SSDs for the slog

 For capacity, you need:
        1 GByte/sec * 30 sec = 30 GBytes

Where did the 30 seconds come in here?

The amount of time to hold cache depends on how fast you can fill it.

 Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
 or so.

I'm thinking you can do less if you don't need to hold it for 30 seconds.

 Both of the above assume there is lots of memory in the server.
 This is increasingly becoming easier to do as the memory costs
 come down and you can physically fit 512 GBytes in a 4u server.
 By default, the txg commit will occur when 1/8 of memory is used
 for writes. For 30 GBytes, that would mean a main memory of only
 240 Gbytes... feasible for modern servers.

 However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
 NVRAM in their arrays. So Bob's recommendation of reducing the
 txg commit interval below 30 seconds also has merit.  Or, to put it
 another way, the dynamic sizing of the txg commit interval isn't
 quite perfect yet. [Cue for Neil to chime in... :-)]

I'm sorry did I miss something Bob said about the txg commit interval?

I looked back and didn't see it, maybe it was off-list?

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson
j...@jamver.id.au said:
 For a predominantly NFS server purpose, it really looks like a case of the
 slog has to outperform your main pool for continuous write speed as well as
 an instant response time as the primary criterion. Which might as well be a
 fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
 them. 

I wonder if you ran Richard Elling's zilstat while running your
workload.  That should tell you how much ZIL bandwidth is needed,
and it would be interesting to see if its stats match with your
other measurements of slog-device traffic.

I did some filebench and tar extract over NFS tests of J4400 (500GB,
7200RPM SATA drives), with and without slog, where slog was using the
internal 2.5 10kRPM SAS drives in an X4150.  These drives were behind
the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
cache memory, all on Solaris-10U7.

We saw slight differences on filebench oltp profile, and a huge speedup
for the tar extract over NFS tests with the slog present.  Granted, the
latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
good results for a poor-person's slog, though:
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Just as an aside, and based on my experience as a user/admin of various
NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
to get very good improvements with relatively small amounts of NVRAM
(128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
tens of GB of NVRAM.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson hakan...@ohsu.edu wrote:
 j...@jamver.id.au said:
 For a predominantly NFS server purpose, it really looks like a case of the
 slog has to outperform your main pool for continuous write speed as well as
 an instant response time as the primary criterion. Which might as well be a
 fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
 them.

 I wonder if you ran Richard Elling's zilstat while running your
 workload.  That should tell you how much ZIL bandwidth is needed,
 and it would be interesting to see if its stats match with your
 other measurements of slog-device traffic.

Yes, but if it's on NFS you can just figure out the workload in MB/s
and use that as a rough guideline.

Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.

 I did some filebench and tar extract over NFS tests of J4400 (500GB,
 7200RPM SATA drives), with and without slog, where slog was using the
 internal 2.5 10kRPM SAS drives in an X4150.  These drives were behind
 the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
 cache memory, all on Solaris-10U7.

 We saw slight differences on filebench oltp profile, and a huge speedup
 for the tar extract over NFS tests with the slog present.  Granted, the
 latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
 good results for a poor-person's slog, though:
        http://acc.ohsu.edu/~hakansom/j4400_bench.html

I did a smiliar test with a 512MB BBU controller and saw no difference
with or without the SSD slog, so I didn't end up using it.

Does your BBU controller ignore the ZFS flushes?

 Just as an aside, and based on my experience as a user/admin of various
 NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
 to get very good improvements with relatively small amounts of NVRAM
 (128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
 tens of GB of NVRAM.

They don't hold on to the cache for a long time, just as long as it
takes to write it all to disk.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Richard Elling wrote:

By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.


Ahem.  We were advised that 7/8s of memory is currently what is 
allowed for writes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson
rswwal...@gmail.com said:
 Yes, but if it's on NFS you can just figure out the workload in MB/s and use
 that as a rough guideline. 

I wonder if that's the case.  We have an NFS server without NVRAM cache
(X4500), and it gets huge MB/sec throughput on large-file writes over NFS.
But it's painfully slow on the tar extract lots of small files test,
where many, tiny, synchronous metadata operations are performed.


 I did a smiliar test with a 512MB BBU controller and saw no difference with
 or without the SSD slog, so I didn't end up using it.
 
 Does your BBU controller ignore the ZFS flushes? 

I believe it does (it would be slow otherwise).  It's the Sun StorageTek
internal SAS RAID HBA.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker


On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems  
wrong to me.  Previously we were advised that the slog is basically  
a log of uncommitted system calls so the size of the data chunks  
written to the slog should be similar to the data sizes in the  
system calls.


Are these not broken into recordsize chunks?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Neil Perrin



On 09/25/09 16:19, Bob Friesenhahn wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Log blocks are variable in size dependent on what needs to be committed.
The minimum size is 4KB and the max 128KB. Log records are aggregated
and written together as much as possible.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread Bob Friesenhahn

On Thu, 24 Sep 2009, James Lever wrote:


I was of the (mis)understanding that only metadata and writes smaller than 
64k went via the slog device in the event of an O_SYNC write request?


What would cause you to understand that?

Is there a way to tune this on the NFS server or clients such that when I 
perform a large synchronous write, the data does not go via the slog device?


Synchronous writes are needed by NFS to support its atomic write 
requirement.  It sounds like your SSD is write-bandwidth bottlenecked 
rather than IOPS bottlenecked.  Replacing your SSD with a more 
performant one seems like the first step.


NFS client tunings can make a big difference when it comes to 
performance.  Check the nfs(5) manual page for your Linux systems to 
see what options are available.  An obvious tunable is 'wsize' which 
should ideally match (or be a multiple of) the zfs filesystem block 
size.  The /proc/mounts file for my Debian install shows that 1048576 
is being used.  This is quite large and perhaps a smaller value would 
help.  If you are willing to accept the risk, using the Linux 'async' 
mount option may make things seem better.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread Richard Elling

comment below...

On Sep 23, 2009, at 10:00 PM, James Lever wrote:


On 08/09/2009, at 2:01 AM, Ross Walker wrote:

On Sep 7, 2009, at 1:32 AM, James Lever j...@jamver.id.au wrote:

Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
160, big difference.


The issue is interactive responsiveness and if there is a way to  
tune the system to give that while still having good performance  
for builds when they are run.


Look at the write IOPS of the pool with the zpool iostat -v and  
look at how many are happening on the RAIDZ2 vdev.


I was suggesting that slog write were possibly starving reads from  
the l2arc as they were on the same device.  This appears not to  
have been the issue as the problem has persisted even with the  
l2arc devices removed from the pool.


The SSD will handle a lot more IOPS then the pool and L2ARC is a  
lazy reader, it mostly just holds on to read cache data.


It just may be that the pool configuration just can't handle the  
write IOPS needed and reads are starving.


Possible, but hard to tell.  Have a look at the iostat results  
I’ve posted.


The busy times of the disks while the issue is occurring should let  
you know.


So it turns out that the problem is that all writes coming via NFS  
are going through the slog.  When that happens, the transfer speed  
to the device drops to ~70MB/s (the write speed of his SLC SSD) and  
until the load drops all new write requests are blocked causing a  
noticeable delay (which has been observed to be up to 20s, but  
generally only 2-4s).


Thank you sir, can I have another?
If you add (not attach) more slogs, the workload will be spread across  
them.  But...




I can reproduce this behaviour by copying a large file (hundreds of  
MB in size) using 'cp src dst’ on an NFS (still currently v3) client  
and observe that all data is pushed through the slog device (10GB  
partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC)  
rather than going direct to the primary storage disks.


On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate log  
devices) and the second one was consistently running significantly  
slower than the first.  Removing the second device made an  
improvement on performance, but did not remove the occasional  
observed pauses.


...this is not surprising, when you add a slow slog device.  This is  
the weakest link rule.


I was of the (mis)understanding that only metadata and writes  
smaller than 64k went via the slog device in the event of an O_SYNC  
write request?


The threshold is 32 kBytes, which is unfortunately the same as the  
default

NFS write size. See CR6686887
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887

If you have a slog and logbias=latency (default) then the writes go to  
the slog.
So there is some interaction here that can affect NFS workloads in  
particular.




The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via  
the slog device?


You can change the IOP size on the client.
 -- richard



I have investigated using the logbias setting, but that will just  
kill small file performance also on any filesystem using it and  
defeat the purpose of having a slog device at all.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 2:58 AM, Richard Elling wrote:


On Sep 23, 2009, at 10:00 PM, James Lever wrote:

So it turns out that the problem is that all writes coming via NFS  
are going through the slog.  When that happens, the transfer speed  
to the device drops to ~70MB/s (the write speed of his SLC SSD) and  
until the load drops all new write requests are blocked causing a  
noticeable delay (which has been observed to be up to 20s, but  
generally only 2-4s).


Thank you sir, can I have another?
If you add (not attach) more slogs, the workload will be spread  
across them.  But...


My log configurations is :

logs
  c7t2d0s0   ONLINE   0 0 0
  c7t3d0s0   OFFLINE  0 0 0

I’m going to test the now removed SSD and see if I can get it to  
perform significantly worse than the first one, but my memory of  
testing these at pre-production testing was that they were both  
equally slow but not significantly different.


On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate  
log devices) and the second one was consistently running  
significantly slower than the first.  Removing the second device  
made an improvement on performance, but did not remove the  
occasional observed pauses.


...this is not surprising, when you add a slow slog device.  This is  
the weakest link rule.


So, in theory, even if one of the two SSDs was even slightly slower  
than the other, it would just appear that it would be more heavily  
effected?


Here is part of what I’m not understanding - unless one SSD is  
significantly worse than the other, how can the following scenario be  
true?  Here is some iostat output from the two slog devices at 1s  
intervals when it gets a large series of write requests.


Idle at start.

0.0 1462.00.0 187010.2  0.0 28.60.0   19.6   2  83   0
0   0   0 c7t2d0
0.0  233.00.0  29823.7  0.0 28.70.0  123.3   0  83   0
0   0   0 c7t3d0


NVRAM cache close to full. (256MB BBC)

0.0   84.00.0 10622.0  0.0  3.50.0   41.2   0  12   0
0   0   0 c7t2d0
0.00.00.0 0.0  0.0 35.00.00.0   0 100   0
0   0   0 c7t3d0


0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  305.00.0 39039.3  0.0 35.00.0  114.7   0 100   0
0   0   0 c7t3d0



0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  361.00.0 46208.1  0.0 35.00.0   96.8   0 100   0
0   0   0 c7t3d0



0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  329.00.0 42114.0  0.0 35.00.0  106.3   0 100   0
0   0   0 c7t3d0


0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  317.00.0 40449.6  0.0 27.40.0   86.5   0  85   0
0   0   0 c7t3d0


0.04.00.0   263.8  0.0  0.00.00.2   0   0   0
0   0   0 c7t2d0
0.04.00.0   367.8  0.0  0.00.00.3   0   0   0
0   0   0 c7t3d0


What determines the size of the writes or distribution between slog  
devices?  It looks like ZFS decided to send a large chunk to one slog  
which nearly filled the NVRAM, and then continue writing to the other  
one, which meant that it had to go at device speed (whatever that is  
for the data size/write size).   Is there a way to tune the writes to  
multiple slogs to be (for arguments sake) 10MB slices?


I was of the (mis)understanding that only metadata and writes  
smaller than 64k went via the slog device in the event of an O_SYNC  
write request?


The threshold is 32 kBytes, which is unfortunately the same as the  
default

NFS write size. See CR6686887
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887

If you have a slog and logbias=latency (default) then the writes go  
to the slog.
So there is some interaction here that can affect NFS workloads in  
particular.


Interesting CR.

nfsstat -m output on one of the linux hosts (ubuntu)

 Flags:  
rw 
,vers 
= 
3 
,rsize 
= 
1048576 
,wsize 
= 
1048576 
,namlen 
= 
255 
,hard 
,nointr 
,noacl 
,proto 
= 
tcp 
,timeo 
= 
600 
,retrans 
=2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17


rsize and wsize auto tuned to 1MB.  How does this effect the sync  
request threshold?



The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via  
the slog device?


You can change the IOP size on the client.



You’re suggesting modifying rsize/wsize?  or something else?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 1:24 AM, Bob Friesenhahn wrote:


On Thu, 24 Sep 2009, James Lever wrote:


Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via  
the slog device?


Synchronous writes are needed by NFS to support its atomic write  
requirement.  It sounds like your SSD is write-bandwidth  
bottlenecked rather than IOPS bottlenecked.  Replacing your SSD with  
a more performant one seems like the first step.


NFS client tunings can make a big difference when it comes to  
performance.  Check the nfs(5) manual page for your Linux systems to  
see what options are available.  An obvious tunable is 'wsize' which  
should ideally match (or be a multiple of) the zfs filesystem block  
size.  The /proc/mounts file for my Debian install shows that  
1048576 is being used.  This is quite large and perhaps a smaller  
value would help.  If you are willing to accept the risk, using the  
Linux 'async' mount option may make things seem better.


From the Linux NFS FAQ.  http://nfs.sourceforge.net/

NFS Version 3 introduces the concept of safe asynchronous writes.”

And it continues.

My rsize and wsize are negotiating to 1MB.

James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, James Lever wrote:


NFS Version 3 introduces the concept of safe asynchronous writes.?


Being safe then requires a responsibilty level on the client which 
is often not present.  For example, if the server crashes, and then 
the client crashes, how does the client resend the uncommitted data? 
If the client had a non-volatile storage cache, then it would be able 
to responsibly finish the writes that failed.


The commentary says that normally the COMMIT operations occur during 
close(2) or fsync(2) system call, or when encountering memory 
pressure.  If the problem is slow copying of many small files, this 
COMMIT approach does not help very much since very little data is sent 
per file and most time is spent creating directories and files.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:

The commentary says that normally the COMMIT operations occur during  
close(2) or fsync(2) system call, or when encountering memory  
pressure.  If the problem is slow copying of many small files, this  
COMMIT approach does not help very much since very little data is  
sent per file and most time is spent creating directories and files.


The problem appears to be slog bandwidth exhaustion due to all data  
being sent via the slog creating a contention for all following NFS or  
locally synchronous writes.  The NFS writes do not appear to be  
synchronous in nature - there is only a COMMIT being issued at the  
very end, however, all of that data appears to be going via the slog  
and it appears to be inflating to twice its original size.


For a test, I just copied a relatively small file (8.4MB in size).   
Looking at a tcpdump analysis using wireshark, there is a SETATTR  
which ends with a V3 COMMIT and no COMMIT messages during the transfer.


iostat output that matches looks like this:

slog write of the data (17MB appears to hit the slog)

Friday, 25 September 2009  1:01:00 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  135.00.0 17154.5  0.0  0.80.06.0   0   3   0
0   0   0 c7t2d0


then a few seconds later, the transaction group gets flushed to  
primary storage writing nearly 11.4MB which is inline with raid Z2  
(expect around 10.5MB; 8.4/8*10):


Friday, 25 September 2009  1:01:13 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0   91.00.0 1170.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t0d0
0.0   84.00.0 1171.4  0.0  0.10.01.2   0   2   0
0   0   0 c11t1d0
0.0   92.00.0 1172.4  0.0  0.10.01.2   0   2   0
0   0   0 c11t2d0
0.0   84.00.0 1172.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t3d0
0.0   81.00.0 1176.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t4d0
0.0   86.00.0 1176.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t5d0
0.0   89.00.0 1175.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t6d0
0.0   84.00.0 1175.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t7d0
0.0   91.00.0 1168.9  0.0  0.10.01.3   0   2   0
0   0   0 c11t8d0
0.0   89.00.0 1170.9  0.0  0.10.01.4   0   2   0
0   0   0 c11t9d0


So I performed the same test with a much larger file (533MB) to see  
what it would do, being larger than the NVRAM cache in front of the  
SSD.  Note that after the second second of activity the NVRAM is full  
and only allowing in about the sequential write speed of the SSD  
(~70MB/s).


Friday, 25 September 2009  1:13:14 PM EST
extended device statistics     
errors ---
r/sw/s   kr/skw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  640.90.0  81782.9  0.0  4.20.06.5   1  14   0
0   0   0 c7t2d0
0.0 1065.70.0 136408.1  0.0 18.60.0   17.5   1  78   0
0   0   0 c7t2d0
0.0  579.00.0  74113.3  0.0 30.70.0   53.1   1 100   0
0   0   0 c7t2d0
0.0  588.70.0  75357.0  0.0 33.20.0   56.3   1 100   0
0   0   0 c7t2d0
0.0  532.00.0  68096.3  0.0 31.50.0   59.1   1 100   0
0   0   0 c7t2d0
0.0  559.00.0  71428.0  0.0 32.50.0   58.1   1 100   0
0   0   0 c7t2d0
0.0  542.00.0  68755.9  0.0 25.10.0   46.4   1 100   0
0   0   0 c7t2d0
0.0  542.00.0  69376.4  0.0 35.00.0   64.6   1 100   0
0   0   0 c7t2d0
0.0  581.00.0  74368.0  0.0 30.60.0   52.6   1 100   0
0   0   0 c7t2d0
0.0  567.00.0  72574.1  0.0 33.20.0   58.6   1 100   0
0   0   0 c7t2d0
0.0  564.00.0  72194.1  0.0 31.10.0   55.2   1 100   0
0   0   0 c7t2d0
0.0  573.00.0  73343.5  0.0 33.20.0   57.9   1 100   0
0   0   0 c7t2d0
0.0  536.30.0  68640.5  0.0 33.10.0   61.7   1 100   0
0   0   0 c7t2d0
0.0  121.90.0  15608.9  0.0  2.70.0   22.1   0  22   0
0   0   0 c7t2d0


Again, the slog wrote about double the file size (1022.6MB) and a few  
seconds later, the data was pushed to the primary storage (684.9MB  
with an expectation of 666MB = 533MB/8*10) so again about the right  
number hit the spinning platters.


Friday, 25 September 2009  1:13:43 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  338.30.0 32794.4  0.0 13.70.0   40.6   1  47   0
0   0   0 c11t0d0
0.0  325.30.0 31399.8  0.0 13.70.0   42.0   

Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever
I thought I would try the same test using dd bs=131072 if=source of=/ 
path/to/nfs to see what the results looked liked…


It is very similar to before, about 2x slog usage and same timing and  
write totals.


Friday, 25 September 2009  1:49:48 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/ 
w trn tot device
0.0 1538.70.0 196834.0  0.0 23.10.0   15.0   2  67   0
0   0   0 c7t2d0
0.0  562.00.0  71942.3  0.0 35.00.0   62.3   1 100   0
0   0   0 c7t2d0
0.0  590.70.0  75614.4  0.0 35.00.0   59.2   1 100   0
0   0   0 c7t2d0
0.0  600.90.0  76920.0  0.0 35.00.0   58.2   1 100   0
0   0   0 c7t2d0
0.0  546.00.0  69887.9  0.0 35.00.0   64.1   1 100   0
0   0   0 c7t2d0
0.0  554.00.0  70913.9  0.0 35.00.0   63.2   1 100   0
0   0   0 c7t2d0
0.0  598.00.0  76549.2  0.0 35.00.0   58.5   1 100   0
0   0   0 c7t2d0
0.0  563.00.0  72065.1  0.0 35.00.0   62.1   1 100   0
0   0   0 c7t2d0
0.0  588.10.0  75282.6  0.0 31.50.0   53.5   1 100   0
0   0   0 c7t2d0
0.0  564.00.0  72195.7  0.0 34.80.0   61.7   1 100   0
0   0   0 c7t2d0
0.0  582.80.0  74599.8  0.0 35.00.0   60.0   1 100   0
0   0   0 c7t2d0
0.0  544.00.0  69633.3  0.0 35.00.0   64.3   1 100   0
0   0   0 c7t2d0
0.0  530.00.0  67191.5  0.0 30.60.0   57.7   0  90   0
0   0   0 c7t2d0


And then the write to primary storage a few seconds later:

Friday, 25 September 2009  1:50:14 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  426.30.0 32196.3  0.0 12.70.0   29.8   1  45   0
0   0   0 c11t0d0
0.0  410.40.0 31857.1  0.0 12.40.0   30.3   1  45   0
0   0   0 c11t1d0
0.0  426.30.0 30698.1  0.0 13.00.0   30.5   1  45   0
0   0   0 c11t2d0
0.0  429.30.0 31392.3  0.0 12.60.0   29.4   1  45   0
0   0   0 c11t3d0
0.0  443.20.0 33280.8  0.0 12.90.0   29.1   1  45   0
0   0   0 c11t4d0
0.0  424.30.0 33872.4  0.0 12.70.0   30.0   1  45   0
0   0   0 c11t5d0
0.0  432.30.0 32903.2  0.0 12.60.0   29.2   1  45   0
0   0   0 c11t6d0
0.0  418.30.0 32562.0  0.0 12.50.0   29.9   1  45   0
0   0   0 c11t7d0
0.0  417.30.0 31746.2  0.0 12.40.0   29.8   1  44   0
0   0   0 c11t8d0
0.0  424.30.0 31270.6  0.0 12.70.0   29.9   1  45   0
0   0   0 c11t9d0

Friday, 25 September 2009  1:50:15 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  434.90.0 37028.5  0.0 17.30.0   39.7   1  52   0
0   0   0 c11t0d0
1.0  436.9   64.3 37372.1  0.0 17.10.0   39.0   1  51   0
0   0   0 c11t1d0
1.0  442.9   64.3 38543.2  0.0 17.20.0   38.7   1  52   0
0   0   0 c11t2d0
1.0  436.9   64.3 37834.2  0.0 17.30.0   39.6   1  52   0
0   0   0 c11t3d0
1.0  412.8   64.3 35935.0  0.0 16.80.0   40.7   0  52   0
0   0   0 c11t4d0
1.0  413.8   64.3 35342.5  0.0 16.60.0   40.1   0  51   0
0   0   0 c11t5d0
2.0  418.8  128.6 36321.3  0.0 16.50.0   39.3   0  52   0
0   0   0 c11t6d0
1.0  425.8   64.3 36660.4  0.0 16.60.0   39.0   1  51   0
0   0   0 c11t7d0
1.0  437.9   64.3 37484.0  0.0 17.20.0   39.2   1  52   0
0   0   0 c11t8d0
0.0  437.90.0 37968.1  0.0 17.20.0   39.2   1  52   0
0   0   0 c11t9d0


So, 533MB source file, 13 seconds to write to the slog (14 before, no  
appreciable change), 1071.5MB written to the slog, 692.3MB written to  
primary storage.


Just another data point.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-23 Thread James Lever


On 08/09/2009, at 2:01 AM, Ross Walker wrote:

On Sep 7, 2009, at 1:32 AM, James Lever j...@jamver.id.au wrote:

Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
160, big difference.


The issue is interactive responsiveness and if there is a way to  
tune the system to give that while still having good performance  
for builds when they are run.


Look at the write IOPS of the pool with the zpool iostat -v and look  
at how many are happening on the RAIDZ2 vdev.


I was suggesting that slog write were possibly starving reads from  
the l2arc as they were on the same device.  This appears not to  
have been the issue as the problem has persisted even with the  
l2arc devices removed from the pool.


The SSD will handle a lot more IOPS then the pool and L2ARC is a  
lazy reader, it mostly just holds on to read cache data.


It just may be that the pool configuration just can't handle the  
write IOPS needed and reads are starving.


Possible, but hard to tell.  Have a look at the iostat results I’ve  
posted.


The busy times of the disks while the issue is occurring should let  
you know.


So it turns out that the problem is that all writes coming via NFS are  
going through the slog.  When that happens, the transfer speed to the  
device drops to ~70MB/s (the write speed of his SLC SSD) and until the  
load drops all new write requests are blocked causing a noticeable  
delay (which has been observed to be up to 20s, but generally only  
2-4s).


I can reproduce this behaviour by copying a large file (hundreds of MB  
in size) using 'cp src dst’ on an NFS (still currently v3) client and  
observe that all data is pushed through the slog device (10GB  
partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) rather  
than going direct to the primary storage disks.


On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate log  
devices) and the second one was consistently running significantly  
slower than the first.  Removing the second device made an improvement  
on performance, but did not remove the occasional observed pauses.


I was of the (mis)understanding that only metadata and writes smaller  
than 64k went via the slog device in the event of an O_SYNC write  
request?


The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via the  
slog device?


I have investigated using the logbias setting, but that will just kill  
small file performance also on any filesystem using it and defeat the  
purpose of having a slog device at all.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread Ross Walker
On Sun, Sep 6, 2009 at 9:15 AM, James Leverj...@jamver.id.au wrote:
 I’m experiencing occasional slow responsiveness on an OpenSolaris b118
 system typically noticed when running an ‘ls’ (no extra flags, so no
 directory service lookups).  There is a delay of between 2 and 30 seconds
 but no correlation has been noticed with load on the server and the slow
 return.  This problem has only been noticed via NFS (v3.  We are migrating
 to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for
 snv_124).  The problem has been observed both locally on the primary
 filesystem, in an locally automounted reference (/home/foo) and remotely via
 NFS.

 zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/
 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs
 each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI
 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).

 The system is configured as an NFS (currently serving NFSv3), iSCSI
 (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with
 authentication taking place from a remote openLDAP server.

 Automount is in use both locally and remotely (linux clients).  Locally
 /home/* is remounted from the zpool, remotely /home and another filesystem
 (and children) are mounted using autofs.  There was some suspicion that
 automount is the problem, but no definitive evidence as of yet.

 The problem has definitely been observed with stats (of some form, typically
 ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in
 /zpool/home/* (the true source location).  There is a clear correlation with
 recency of reads of the directories in question and reoccurrence of the
 fault in that one user has scripted a regular (15m/30m/hourly tests so far)
 ‘ls’ of the filesystems of interested and this has reduced the fault to have
 minimal noted impact since starting down this path (just for themself).

 I have removed the l2arc(s) (cache devices) from the pool and the same
 behaviour has been observed.  My suspicion here was that there was perhaps
 occasional high synchronous load causing heavy writes to the slog devices
 and when a stat was requested it may have been faulting from ARC to L2ARC
 prior to going to the primary data store.  The slowness has been reported
 since removing the extra cache devices.

 Another thought I had was along the lines of fileystem caching and heavy
 writes causing read blocking.  I have no evidence that this is the case, but
 some suggestions on list recently of limiting the ZFS memory usage for write
 caching.  Can anybody comment to the effectiveness of this (I have 256MB
 write cache in front of the slog SSDs and 512MB in front of the primary
 storage devices).

 My DTrace is very poor but I’m suspicious that this is the best way to root
 cause this problem.  If somebody has any code that may assist in debugging
 this problem and was able to share it would much appreciated.

 Any other suggestions for how to identify this fault and work around it
 would be greatly appreciated.

That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.

You have iSCSI, NFS, CIFS to choose from (most obvious), try
restarting them one at a time during down time and see if performance
improves after each restart to find the culprit.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread Richard Elling

On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:


On Sun, Sep 6, 2009 at 9:15 AM, James Leverj...@jamver.id.au wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris  
b118

system typically noticed when running an ‘ls’ (no extra flags, so no
directory service lookups).  There is a delay of between 2 and 30  
seconds
but no correlation has been noticed with load on the server and the  
slow
return.  This problem has only been noticed via NFS (v3.  We are  
migrating
to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
anticipated for

snv_124).  The problem has been observed both locally on the primary
filesystem, in an locally automounted reference (/home/foo) and  
remotely via

NFS.


I'm confused.  If This problem has only been noticed via NFS (v3 then
how is it observed locally?

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
1078 w/
512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with  
2x SSDs
each partitioned as 10GB slog and 36GB remainder as l2arc behind  
another LSI

1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).

The system is configured as an NFS (currently serving NFSv3), iSCSI
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with

authentication taking place from a remote openLDAP server.

Automount is in use both locally and remotely (linux clients).   
Locally
/home/* is remounted from the zpool, remotely /home and another  
filesystem
(and children) are mounted using autofs.  There was some suspicion  
that

automount is the problem, but no definitive evidence as of yet.

The problem has definitely been observed with stats (of some form,  
typically
‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally  
in
/zpool/home/* (the true source location).  There is a clear  
correlation with
recency of reads of the directories in question and reoccurrence of  
the
fault in that one user has scripted a regular (15m/30m/hourly tests  
so far)
‘ls’ of the filesystems of interested and this has reduced the  
fault to have
minimal noted impact since starting down this path (just for  
themself).


iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.

I have removed the l2arc(s) (cache devices) from the pool and the  
same
behaviour has been observed.  My suspicion here was that there was  
perhaps
occasional high synchronous load causing heavy writes to the slog  
devices
and when a stat was requested it may have been faulting from ARC to  
L2ARC
prior to going to the primary data store.  The slowness has been  
reported

since removing the extra cache devices.

Another thought I had was along the lines of fileystem caching and  
heavy
writes causing read blocking.  I have no evidence that this is the  
case, but
some suggestions on list recently of limiting the ZFS memory usage  
for write
caching.  Can anybody comment to the effectiveness of this (I have  
256MB
write cache in front of the slog SSDs and 512MB in front of the  
primary

storage devices).


stat(2) doesn't write, so you can stop worrying about the slog.



My DTrace is very poor but I’m suspicious that this is the best way  
to root
cause this problem.  If somebody has any code that may assist in  
debugging

this problem and was able to share it would much appreciated.

Any other suggestions for how to identify this fault and work  
around it

would be greatly appreciated.


Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.


That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
Physical Memory Control Using the Resource Capping  Daemon
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones
 -- richard


You have iSCSI, NFS, CIFS to choose from (most obvious), try
restarting them one at a time during down time and see if performance
improves after each restart to find the culprit.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 6:24 AM, Richard Elling wrote:


On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:


On Sun, Sep 6, 2009 at 9:15 AM, James Leverj...@jamver.id.au wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris  
b118

system typically noticed when running an ‘ls’ (no extra flags, so no
directory service lookups).  There is a delay of between 2 and 30  
seconds
but no correlation has been noticed with load on the server and  
the slow
return.  This problem has only been noticed via NFS (v3.  We are  
migrating
to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
anticipated for

snv_124).  The problem has been observed both locally on the primary
filesystem, in an locally automounted reference (/home/foo) and  
remotely via

NFS.


I'm confused.  If This problem has only been noticed via NFS (v3  
then

how is it observed locally?”


Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI.

It has been observed in client:/home/user (NFSv3 automount from  
server:/home/user, redirected to server:/zpool/home/user) and also in  
server:/home/user (local automount) and server:/zpool/home/user  
(origin).



iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.


What specifically should I be looking for here? (using ‘iostat -xen -T  
d’)  and I’m guessing I’ll require a high level of granularity (1s  
intervals) to see the issue if it is a single disk or similar.



stat(2) doesn't write, so you can stop worrying about the slog.


My concern here was I may have been trying to write (via other  
concurrent processes) at the same time as there was a memory fault  
from the ARC to L2ARC.



Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.


No errors or collisions from either server or clients observed.


That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
Physical Memory Control Using the Resource Capping  Daemon
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones


Thanks Richard, I’ll have a look at that today and see where I get.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread Ross Walker


Sorry for my earlier post I responded prematurely.


On Sep 6, 2009, at 9:15 AM, James Lever j...@jamver.id.au wrote:

I’m experiencing occasional slow responsiveness on an OpenSolaris b1 
18 system typically noticed when running an ‘ls’ (no extra flags,  
so no directory service lookups).  There is a delay of between 2 and 
 30 seconds but no correlation has been noticed with load on the ser 
ver and the slow return.  This problem has only been noticed via NFS 
 (v3.  We are migrating to NFSv4 once the O_EXCL/mtime bug fix has b 
een integrated - anticipated for snv_124).  The problem has been obs 
erved both locally on the primary filesystem, in an locally automoun 
ted reference (/home/foo) and remotely via NFS.


Have you tried snoop/tcpdump/wirehark on the client side and server  
side to figure out what is being sent and exactly how long it is  
taking to get a response?


zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ 
E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as  
l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with  
PERC 6/i).


This config might lead to heavy sync writes (NFS) starving reads due  
to the fact that the whole RAIDZ2 behaves as a single disk on writes.  
How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?


Just one or two other vdevs to spread the load can make the world of  
difference.


The system is configured as an NFS (currently serving NFSv3), iSCSI  
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with authentication taking place from a remote openLDAP server.


There are a lot of services here, all off one pool? You might be  
trying to bite off more then the config can chew.


Automount is in use both locally and remotely (linux clients).   
Locally /home/* is remounted from the zpool, remotely /home and  
another filesystem (and children) are mounted using autofs.  There  
was some suspicion that automount is the problem, but no definitive  
evidence as of yet.


Try taking a particularly bad problem station and configuring it  
static for a bit to see if it is.


The problem has definitely been observed with stats (of some form,  
typically ‘/usr/bin/ls’ output) both remotely, locally in /home/*  
and locally in /zpool/home/* (the true source location).  There is a 
 clear correlation with recency of reads of the directories in quest 
ion and reoccurrence of the fault in that one user has scripted a re 
gular (15m/30m/hourly tests so far) ‘ls’ of the filesystems of  
interested and this has reduced the fault to have minimal noted impa 
ct since starting down this path (just for themself).


Sounds like the user is pre-fetching his attribute cache to over come  
poor performance.


I have removed the l2arc(s) (cache devices) from the pool and the  
same behaviour has been observed.  My suspicion here was that there  
was perhaps occasional high synchronous load causing heavy writes to  
the slog devices and when a stat was requested it may have been  
faulting from ARC to L2ARC prior to going to the primary data  
store.  The slowness has been reported since removing the extra  
cache devices.


That doesn't make a lot of sense to me the L2ARC is secondary read  
cache, if writes are starving reads then the L2ARC would only help here.


Another thought I had was along the lines of fileystem caching and  
heavy writes causing read blocking.  I have no evidence that this is  
the case, but some suggestions on list recently of limiting the ZFS  
memory usage for write caching.  Can anybody comment to the  
effectiveness of this (I have 256MB write cache in front of the slog  
SSDs and 512MB in front of the primary storage devices).


It just may be that the pool configuration just can't handle the write  
IOPS needed and reads are starving.


My DTrace is very poor but I’m suspicious that this is the best way  
to root cause this problem.  If somebody has any code that may assis 
t in debugging this problem and was able to share it would much appr 
eciated.


Dtrace would tell you, but i wish the learning curve wasn't so steep  
to get it going.


Any other suggestions for how to identify this fault and work around  
it would be greatly appreciated.


I hope I gave some good pointers. I'd first look at the pool  
configuration.


-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread Richard Elling

On Sep 6, 2009, at 5:06 PM, James Lever wrote:

On 07/09/2009, at 6:24 AM, Richard Elling wrote:

On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
On Sun, Sep 6, 2009 at 9:15 AM, James Leverj...@jamver.id.au wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris  
b118
system typically noticed when running an ‘ls’ (no extra flags, so  
no
directory service lookups).  There is a delay of between 2 and 30  
seconds
but no correlation has been noticed with load on the server and  
the slow
return.  This problem has only been noticed via NFS (v3.  We are  
migrating
to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
anticipated for
snv_124).  The problem has been observed both locally on the  
primary
filesystem, in an locally automounted reference (/home/foo) and  
remotely via

NFS.


I'm confused.  If This problem has only been noticed via NFS (v3  
then

how is it observed locally?”


Sorry, I was meaning to say it had not been noticed using CIFS or  
iSCSI.


It has been observed in client:/home/user (NFSv3 automount from  
server:/home/user, redirected to server:/zpool/home/user) and also  
in server:/home/user (local automount) and server:/zpool/home/user  
(origin).


Ok, just so I am clear, when you mean local automount you are
on the server and using the loopback -- no NFS or network involved?


iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.


What specifically should I be looking for here? (using ‘iostat -xen - 
T d’)  and I’m guessing I’ll require a high level of granularity (1s  
intervals) to see the issue if it is a single disk or similar.


You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck  1 and the asvc_t  1000


stat(2) doesn't write, so you can stop worrying about the slog.


My concern here was I may have been trying to write (via other  
concurrent processes) at the same time as there was a memory fault  
from the ARC to L2ARC.


stat(2) looks at metadata, which is generally small and compressed.
It is also cached in the ARC, by default. If this is repeatable in a  
short

period of time, then it is not an  I/O problem and you need to look at:
1. the number of files in the directory
2. the locale (ls sorts by default, and your locale affects the sort  
time)



Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.


No errors or collisions from either server or clients observed.


retrans?
As Ross mentioned, wireshark, snoop, or most other network monitors
will show network traffic in detail.
 -- richard


That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
Physical Memory Control Using the Resource Capping  Daemon
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones


Thanks Richard, I’ll have a look at that today and see where I get.

cheers,
James



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 11:08 AM, Richard Elling wrote:


Ok, just so I am clear, when you mean local automount you are
on the server and using the loopback -- no NFS or network involved?


Correct.  And the behaviour has been seen locally as well as remotely.


You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck  1 and the asvc_t   
1000


Just started having some slow responsiveness reported form a user  
using emacs (autosave, start of a build) so a small file write request.


The second or so before they went to do this, it appears as if the  
raid cache in front of the slog devices was nearly filled and the SSDs  
were being utilised quite heavily, but then there was a break where I  
am seeing relatively light usage on the slog but 100% busy on the  
device reported.


The iostat output is at the end of this message - I can’t make any  
real sense out of why a user would have seen a ~4s delay at about  
2:39:17-18.  Only one of the two slog devices are being used at all.   
Is there some tunable about how multiple slogs are used?


c7t[01] are rpool
c7t[23] are slog devices in the data pool
c11t* are the primary storage devices for the data pool

cheers,
James

Monday,  7 September 2009  2:39:17 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.0 1475.00.0 188799.0  0.0 30.20.0   20.5   2  90   0
0   0   0 c7t2d0
0.0  232.00.0 29571.8  0.0 33.80.0  145.9   0  98   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t4d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t5d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t6d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t8d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t9d0

Monday,  7 September 2009  2:39:18 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.00.00.00.0  0.0 35.00.00.0   0 100   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t4d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t5d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t6d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t8d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t9d0

Monday,  7 September 2009  2:39:19 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  341.00.0 43650.1  0.0 35.00.0  102.5   0 100   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.0  

Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 10:46 AM, Ross Walker wrote:

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ 
E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as  
l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with  
PERC 6/i).


This config might lead to heavy sync writes (NFS) starving reads due  
to the fact that the whole RAIDZ2 behaves as a single disk on  
writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?


Just one or two other vdevs to spread the load can make the world of  
difference.


This was a management decision.  I wanted to go down the striped  
mirrored pair solution, but the amount of space lost was considered  
too great.  RAIDZ2 was considered the best value option for our  
environment.


The system is configured as an NFS (currently serving NFSv3), iSCSI  
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with authentication taking place from a remote openLDAP server.


There are a lot of services here, all off one pool? You might be  
trying to bite off more then the config can chew.


That’s not a lot of services, really.  We have 6 users doing builds on  
multiple platforms and using the storage as their home directory  
(windows and unix).


The issue is interactive responsiveness and if there is a way to tune  
the system to give that while still having good performance for builds  
when they are run.


Try taking a particularly bad problem station and configuring it  
static for a bit to see if it is.


That has been considered also, but the issue has also been observed  
locally on the fileserver.


That doesn't make a lot of sense to me the L2ARC is secondary read  
cache, if writes are starving reads then the L2ARC would only help  
here.


I was suggesting that slog write were possibly starving reads from the  
l2arc as they were on the same device.  This appears not to have been  
the issue as the problem has persisted even with the l2arc devices  
removed from the pool.


It just may be that the pool configuration just can't handle the  
write IOPS needed and reads are starving.


Possible, but hard to tell.  Have a look at the iostat results I’ve  
posted.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss