Re: [zfs-discuss] ZFS upgrade.

2010-01-07 Thread James Lever
Hi John,

On 08/01/2010, at 7:19 AM, john_dil...@blm.gov wrote:

 Is there a way to upgrade my current ZFS version.  I show the version could
 be as high as 22.

The version of Solaris you are running only suport ZFS versions up to version 
15 as demonstrated by your zfs upgrade -v output. You probably need a newer 
version of Solaris, but I cannot tell you if any newer versions support later 
zfs versions.

This forum is for OpenSolaris support.  You should contact your Solaris support 
provider for further help on this matter.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to destroy your system in funny way with ZFS

2009-12-27 Thread James Lever
Hi Tomas,

On 27/12/2009, at 7:25 PM, Tomas Bodzar wrote:

 pfexec zpool set dedup=verify rpool
 pfexec zfs set compression=gzip-9 rpool
 pfexec zfs set devices=off rpool/export/home
 pfexec zfs set exec=off rpool/export/home
 pfexec zfs set setuid=off rpool/export/home

grub doesn’t support gzip - so you will need to unset that and hope that it can 
still boot with what has been written to disk.  It is possible you will need to 
backup/reinstall.

I learnt this one the hard way - don’t use gzip compression on the root of your 
rpool (you can on child filesystems that are not involved in the boot process 
though)

HTH,
James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread James Lever

On 10/12/2009, at 5:36 AM, Adam Leventhal wrote:

 The dedup property applies to all writes so the settings for the pool of 
 origin don't matter, just those on the destination pool.

Just a quick related question I’ve not seen answered anywhere else:

Is it safe to have dedup running on your rpool? (at install time, or if you 
need to migrate your rpool to new media)

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS ZIL/log on SSD weirdness

2009-11-17 Thread James Lever

On 18/11/2009, at 7:33 AM, Dushyanth wrote:

 Now when i run dd and create a big file on /iftraid0/fs and watch `iostat 
 -xnz 2` i dont see any stats for c8t4d0 nor does the write performance 
 improves. 
 
 I have not formatted either c9t9d0 or c8t4d0. What am i missing ?

Last I checked, iSCSI volumes go direct to the primary storage and not via the 
slog device.

Can anybody confirm that is the case and if there is a mechanism/tuneable to 
force it via the slog and if there is any benefit/point in this for most cases?

cheers,
James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread James Lever


On 03/11/2009, at 7:32 AM, Daniel Streicher wrote:

But how can I update my current OpenSolaris (2009.06) or Solaris  
10 (5/09) to use this.

Or have I wait for a new stable release of Solaris 10 / OpenSolaris?


For OpenSolaris, you change your repository and switch to the  
development branches - should be available to public in about 3-3.5  
weeks time.  Plenty of instructions on how to do this on the net and  
in this list.


For Solaris, you need to wait for the next update release.

cheers,
James


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 10 samba in AD mode broken when user in 32 AD groups

2009-10-13 Thread James Lever


On 14/10/2009, at 2:27 AM, casper@sun.com wrote:


So why not the built-in CIFS support in OpenSolaris?  Probably has a
similar issue, but still.


In my case, it’s at least two reasons:

 * Crossing mountpoints requires separate shares - Samba can share an  
entire hierarchy regardless of ZFS filesystems beneath the sharepoint.


 * LDAP integration - the in-kernel CIFS only supports real AD (LDAP 
+krb5) for directory binding otherwise all users must have a  
separately managed local system accounts.


Until these features are available via the in-kernel CIFS  
implementation, I’m forced to stick with Samba for our CIFS needs.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread James Lever


On 26/09/2009, at 1:14 AM, Ross Walker wrote:


By any chance do you have copies=2 set?


No, only 1.  So the double data going to the slog (as reported by  
iostat) is still confusing me and clearly potentially causing  
significant harm to my performance.



Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw


That’s an interesting concept.  All data still appears to go via the  
slog device, however, under heavy load my responsive to a new write is  
typically below 2s (a few outliers at about 3.5s) and a read  
(directory listing of a non-cached entry) is about 2s.


What will this do once it hits the limit?  Will streaming writes now  
be sent directly to a txg and streamed to the primary storage  
devices?  (that is what I would like to see happen).



As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.



It seems that it doesn’t matter what the workload is if the NFS pipe  
can sustain more continuous throughput the slog chain can support.


I suppose some creative use of the logbias setting might assist this  
situation and force all potentially heavy writers directly to the  
primary storage.  This would, however, negate any benefit for having a  
fast, low latency device for those filesystems for the times when it  
is desirable (any large batch of small writes, for example).


Is there a way to have a dynamic, auto logbias type setting depending  
on the transaction currently presented to the server such that if it  
is clearly a large streaming write it gets treated as  
logbias=throughput and if it is a small transaction it gets treated as  
logbias=latency?  (i.e. such that NFS transactions can be effectively  
treated as if it was local storage but minorly breaking the benefits  
of the txg scheduling).


On 26/09/2009, at 3:39 AM, Richard Elling wrote:


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30  
GBytes

or so.


At this point, enter the fusionIO cards or similar devices.   
Unfortunately there does not seem to be anything on the market with  
infinitely fast write capacity (memory speeds) that is also supported  
under OpenSolaris as a slog device.


I think this is precisely what I (and anybody running a general  
purpose NFS server) need for a general purpose slog device.



Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]


How does reducing the txg commit interval really help?  WIll data no  
longer go via the slog once it is streaming to disk?  or will data  
still all be pushed through the slog regardless?


For a predominantly NFS server purpose, it really looks like a case of  
the slog has to outperform your main pool for continuous write speed  
as well as an instant response time as the primary criterion. Which  
might as well be a fast (or group of fast) SSDs or 15kRPM drives with  
some NVRAM in front of them.


Is there also a way to throttle synchronous writes to the slog  
device?  Much like the ZFS write throttling that is already  
implemented, so that there is a gap for new writers to enter when  
writing to the slog device? (or is this the norm and includes slog  
writes?)


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 2:58 AM, Richard Elling wrote:


On Sep 23, 2009, at 10:00 PM, James Lever wrote:

So it turns out that the problem is that all writes coming via NFS  
are going through the slog.  When that happens, the transfer speed  
to the device drops to ~70MB/s (the write speed of his SLC SSD) and  
until the load drops all new write requests are blocked causing a  
noticeable delay (which has been observed to be up to 20s, but  
generally only 2-4s).


Thank you sir, can I have another?
If you add (not attach) more slogs, the workload will be spread  
across them.  But...


My log configurations is :

logs
  c7t2d0s0   ONLINE   0 0 0
  c7t3d0s0   OFFLINE  0 0 0

I’m going to test the now removed SSD and see if I can get it to  
perform significantly worse than the first one, but my memory of  
testing these at pre-production testing was that they were both  
equally slow but not significantly different.


On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate  
log devices) and the second one was consistently running  
significantly slower than the first.  Removing the second device  
made an improvement on performance, but did not remove the  
occasional observed pauses.


...this is not surprising, when you add a slow slog device.  This is  
the weakest link rule.


So, in theory, even if one of the two SSDs was even slightly slower  
than the other, it would just appear that it would be more heavily  
effected?


Here is part of what I’m not understanding - unless one SSD is  
significantly worse than the other, how can the following scenario be  
true?  Here is some iostat output from the two slog devices at 1s  
intervals when it gets a large series of write requests.


Idle at start.

0.0 1462.00.0 187010.2  0.0 28.60.0   19.6   2  83   0
0   0   0 c7t2d0
0.0  233.00.0  29823.7  0.0 28.70.0  123.3   0  83   0
0   0   0 c7t3d0


NVRAM cache close to full. (256MB BBC)

0.0   84.00.0 10622.0  0.0  3.50.0   41.2   0  12   0
0   0   0 c7t2d0
0.00.00.0 0.0  0.0 35.00.00.0   0 100   0
0   0   0 c7t3d0


0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  305.00.0 39039.3  0.0 35.00.0  114.7   0 100   0
0   0   0 c7t3d0



0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  361.00.0 46208.1  0.0 35.00.0   96.8   0 100   0
0   0   0 c7t3d0



0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  329.00.0 42114.0  0.0 35.00.0  106.3   0 100   0
0   0   0 c7t3d0


0.00.00.0 0.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  317.00.0 40449.6  0.0 27.40.0   86.5   0  85   0
0   0   0 c7t3d0


0.04.00.0   263.8  0.0  0.00.00.2   0   0   0
0   0   0 c7t2d0
0.04.00.0   367.8  0.0  0.00.00.3   0   0   0
0   0   0 c7t3d0


What determines the size of the writes or distribution between slog  
devices?  It looks like ZFS decided to send a large chunk to one slog  
which nearly filled the NVRAM, and then continue writing to the other  
one, which meant that it had to go at device speed (whatever that is  
for the data size/write size).   Is there a way to tune the writes to  
multiple slogs to be (for arguments sake) 10MB slices?


I was of the (mis)understanding that only metadata and writes  
smaller than 64k went via the slog device in the event of an O_SYNC  
write request?


The threshold is 32 kBytes, which is unfortunately the same as the  
default

NFS write size. See CR6686887
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887

If you have a slog and logbias=latency (default) then the writes go  
to the slog.
So there is some interaction here that can affect NFS workloads in  
particular.


Interesting CR.

nfsstat -m output on one of the linux hosts (ubuntu)

 Flags:  
rw 
,vers 
= 
3 
,rsize 
= 
1048576 
,wsize 
= 
1048576 
,namlen 
= 
255 
,hard 
,nointr 
,noacl 
,proto 
= 
tcp 
,timeo 
= 
600 
,retrans 
=2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17


rsize and wsize auto tuned to 1MB.  How does this effect the sync  
request threshold?



The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via  
the slog device?


You can change the IOP size on the client.



You’re suggesting modifying rsize/wsize?  or something else?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 1:24 AM, Bob Friesenhahn wrote:


On Thu, 24 Sep 2009, James Lever wrote:


Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via  
the slog device?


Synchronous writes are needed by NFS to support its atomic write  
requirement.  It sounds like your SSD is write-bandwidth  
bottlenecked rather than IOPS bottlenecked.  Replacing your SSD with  
a more performant one seems like the first step.


NFS client tunings can make a big difference when it comes to  
performance.  Check the nfs(5) manual page for your Linux systems to  
see what options are available.  An obvious tunable is 'wsize' which  
should ideally match (or be a multiple of) the zfs filesystem block  
size.  The /proc/mounts file for my Debian install shows that  
1048576 is being used.  This is quite large and perhaps a smaller  
value would help.  If you are willing to accept the risk, using the  
Linux 'async' mount option may make things seem better.


From the Linux NFS FAQ.  http://nfs.sourceforge.net/

NFS Version 3 introduces the concept of safe asynchronous writes.”

And it continues.

My rsize and wsize are negotiating to 1MB.

James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever


On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:

The commentary says that normally the COMMIT operations occur during  
close(2) or fsync(2) system call, or when encountering memory  
pressure.  If the problem is slow copying of many small files, this  
COMMIT approach does not help very much since very little data is  
sent per file and most time is spent creating directories and files.


The problem appears to be slog bandwidth exhaustion due to all data  
being sent via the slog creating a contention for all following NFS or  
locally synchronous writes.  The NFS writes do not appear to be  
synchronous in nature - there is only a COMMIT being issued at the  
very end, however, all of that data appears to be going via the slog  
and it appears to be inflating to twice its original size.


For a test, I just copied a relatively small file (8.4MB in size).   
Looking at a tcpdump analysis using wireshark, there is a SETATTR  
which ends with a V3 COMMIT and no COMMIT messages during the transfer.


iostat output that matches looks like this:

slog write of the data (17MB appears to hit the slog)

Friday, 25 September 2009  1:01:00 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  135.00.0 17154.5  0.0  0.80.06.0   0   3   0
0   0   0 c7t2d0


then a few seconds later, the transaction group gets flushed to  
primary storage writing nearly 11.4MB which is inline with raid Z2  
(expect around 10.5MB; 8.4/8*10):


Friday, 25 September 2009  1:01:13 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0   91.00.0 1170.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t0d0
0.0   84.00.0 1171.4  0.0  0.10.01.2   0   2   0
0   0   0 c11t1d0
0.0   92.00.0 1172.4  0.0  0.10.01.2   0   2   0
0   0   0 c11t2d0
0.0   84.00.0 1172.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t3d0
0.0   81.00.0 1176.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t4d0
0.0   86.00.0 1176.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t5d0
0.0   89.00.0 1175.4  0.0  0.10.01.4   0   2   0
0   0   0 c11t6d0
0.0   84.00.0 1175.4  0.0  0.10.01.3   0   2   0
0   0   0 c11t7d0
0.0   91.00.0 1168.9  0.0  0.10.01.3   0   2   0
0   0   0 c11t8d0
0.0   89.00.0 1170.9  0.0  0.10.01.4   0   2   0
0   0   0 c11t9d0


So I performed the same test with a much larger file (533MB) to see  
what it would do, being larger than the NVRAM cache in front of the  
SSD.  Note that after the second second of activity the NVRAM is full  
and only allowing in about the sequential write speed of the SSD  
(~70MB/s).


Friday, 25 September 2009  1:13:14 PM EST
extended device statistics     
errors ---
r/sw/s   kr/skw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  640.90.0  81782.9  0.0  4.20.06.5   1  14   0
0   0   0 c7t2d0
0.0 1065.70.0 136408.1  0.0 18.60.0   17.5   1  78   0
0   0   0 c7t2d0
0.0  579.00.0  74113.3  0.0 30.70.0   53.1   1 100   0
0   0   0 c7t2d0
0.0  588.70.0  75357.0  0.0 33.20.0   56.3   1 100   0
0   0   0 c7t2d0
0.0  532.00.0  68096.3  0.0 31.50.0   59.1   1 100   0
0   0   0 c7t2d0
0.0  559.00.0  71428.0  0.0 32.50.0   58.1   1 100   0
0   0   0 c7t2d0
0.0  542.00.0  68755.9  0.0 25.10.0   46.4   1 100   0
0   0   0 c7t2d0
0.0  542.00.0  69376.4  0.0 35.00.0   64.6   1 100   0
0   0   0 c7t2d0
0.0  581.00.0  74368.0  0.0 30.60.0   52.6   1 100   0
0   0   0 c7t2d0
0.0  567.00.0  72574.1  0.0 33.20.0   58.6   1 100   0
0   0   0 c7t2d0
0.0  564.00.0  72194.1  0.0 31.10.0   55.2   1 100   0
0   0   0 c7t2d0
0.0  573.00.0  73343.5  0.0 33.20.0   57.9   1 100   0
0   0   0 c7t2d0
0.0  536.30.0  68640.5  0.0 33.10.0   61.7   1 100   0
0   0   0 c7t2d0
0.0  121.90.0  15608.9  0.0  2.70.0   22.1   0  22   0
0   0   0 c7t2d0


Again, the slog wrote about double the file size (1022.6MB) and a few  
seconds later, the data was pushed to the primary storage (684.9MB  
with an expectation of 666MB = 533MB/8*10) so again about the right  
number hit the spinning platters.


Friday, 25 September 2009  1:13:43 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  338.30.0 32794.4  0.0 13.70.0   40.6   1  47   0
0   0   0 c11t0d0
0.0  325.30.0 31399.8  0.0 13.70.0   42.0   

Re: [zfs-discuss] periodic slow responsiveness

2009-09-24 Thread James Lever
I thought I would try the same test using dd bs=131072 if=source of=/ 
path/to/nfs to see what the results looked liked…


It is very similar to before, about 2x slog usage and same timing and  
write totals.


Friday, 25 September 2009  1:49:48 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/ 
w trn tot device
0.0 1538.70.0 196834.0  0.0 23.10.0   15.0   2  67   0
0   0   0 c7t2d0
0.0  562.00.0  71942.3  0.0 35.00.0   62.3   1 100   0
0   0   0 c7t2d0
0.0  590.70.0  75614.4  0.0 35.00.0   59.2   1 100   0
0   0   0 c7t2d0
0.0  600.90.0  76920.0  0.0 35.00.0   58.2   1 100   0
0   0   0 c7t2d0
0.0  546.00.0  69887.9  0.0 35.00.0   64.1   1 100   0
0   0   0 c7t2d0
0.0  554.00.0  70913.9  0.0 35.00.0   63.2   1 100   0
0   0   0 c7t2d0
0.0  598.00.0  76549.2  0.0 35.00.0   58.5   1 100   0
0   0   0 c7t2d0
0.0  563.00.0  72065.1  0.0 35.00.0   62.1   1 100   0
0   0   0 c7t2d0
0.0  588.10.0  75282.6  0.0 31.50.0   53.5   1 100   0
0   0   0 c7t2d0
0.0  564.00.0  72195.7  0.0 34.80.0   61.7   1 100   0
0   0   0 c7t2d0
0.0  582.80.0  74599.8  0.0 35.00.0   60.0   1 100   0
0   0   0 c7t2d0
0.0  544.00.0  69633.3  0.0 35.00.0   64.3   1 100   0
0   0   0 c7t2d0
0.0  530.00.0  67191.5  0.0 30.60.0   57.7   0  90   0
0   0   0 c7t2d0


And then the write to primary storage a few seconds later:

Friday, 25 September 2009  1:50:14 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  426.30.0 32196.3  0.0 12.70.0   29.8   1  45   0
0   0   0 c11t0d0
0.0  410.40.0 31857.1  0.0 12.40.0   30.3   1  45   0
0   0   0 c11t1d0
0.0  426.30.0 30698.1  0.0 13.00.0   30.5   1  45   0
0   0   0 c11t2d0
0.0  429.30.0 31392.3  0.0 12.60.0   29.4   1  45   0
0   0   0 c11t3d0
0.0  443.20.0 33280.8  0.0 12.90.0   29.1   1  45   0
0   0   0 c11t4d0
0.0  424.30.0 33872.4  0.0 12.70.0   30.0   1  45   0
0   0   0 c11t5d0
0.0  432.30.0 32903.2  0.0 12.60.0   29.2   1  45   0
0   0   0 c11t6d0
0.0  418.30.0 32562.0  0.0 12.50.0   29.9   1  45   0
0   0   0 c11t7d0
0.0  417.30.0 31746.2  0.0 12.40.0   29.8   1  44   0
0   0   0 c11t8d0
0.0  424.30.0 31270.6  0.0 12.70.0   29.9   1  45   0
0   0   0 c11t9d0

Friday, 25 September 2009  1:50:15 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.0  434.90.0 37028.5  0.0 17.30.0   39.7   1  52   0
0   0   0 c11t0d0
1.0  436.9   64.3 37372.1  0.0 17.10.0   39.0   1  51   0
0   0   0 c11t1d0
1.0  442.9   64.3 38543.2  0.0 17.20.0   38.7   1  52   0
0   0   0 c11t2d0
1.0  436.9   64.3 37834.2  0.0 17.30.0   39.6   1  52   0
0   0   0 c11t3d0
1.0  412.8   64.3 35935.0  0.0 16.80.0   40.7   0  52   0
0   0   0 c11t4d0
1.0  413.8   64.3 35342.5  0.0 16.60.0   40.1   0  51   0
0   0   0 c11t5d0
2.0  418.8  128.6 36321.3  0.0 16.50.0   39.3   0  52   0
0   0   0 c11t6d0
1.0  425.8   64.3 36660.4  0.0 16.60.0   39.0   1  51   0
0   0   0 c11t7d0
1.0  437.9   64.3 37484.0  0.0 17.20.0   39.2   1  52   0
0   0   0 c11t8d0
0.0  437.90.0 37968.1  0.0 17.20.0   39.2   1  52   0
0   0   0 c11t9d0


So, 533MB source file, 13 seconds to write to the slog (14 before, no  
appreciable change), 1071.5MB written to the slog, 692.3MB written to  
primary storage.


Just another data point.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-23 Thread James Lever


On 08/09/2009, at 2:01 AM, Ross Walker wrote:

On Sep 7, 2009, at 1:32 AM, James Lever j...@jamver.id.au wrote:

Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
160, big difference.


The issue is interactive responsiveness and if there is a way to  
tune the system to give that while still having good performance  
for builds when they are run.


Look at the write IOPS of the pool with the zpool iostat -v and look  
at how many are happening on the RAIDZ2 vdev.


I was suggesting that slog write were possibly starving reads from  
the l2arc as they were on the same device.  This appears not to  
have been the issue as the problem has persisted even with the  
l2arc devices removed from the pool.


The SSD will handle a lot more IOPS then the pool and L2ARC is a  
lazy reader, it mostly just holds on to read cache data.


It just may be that the pool configuration just can't handle the  
write IOPS needed and reads are starving.


Possible, but hard to tell.  Have a look at the iostat results I’ve  
posted.


The busy times of the disks while the issue is occurring should let  
you know.


So it turns out that the problem is that all writes coming via NFS are  
going through the slog.  When that happens, the transfer speed to the  
device drops to ~70MB/s (the write speed of his SLC SSD) and until the  
load drops all new write requests are blocked causing a noticeable  
delay (which has been observed to be up to 20s, but generally only  
2-4s).


I can reproduce this behaviour by copying a large file (hundreds of MB  
in size) using 'cp src dst’ on an NFS (still currently v3) client and  
observe that all data is pushed through the slog device (10GB  
partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) rather  
than going direct to the primary storage disks.


On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate log  
devices) and the second one was consistently running significantly  
slower than the first.  Removing the second device made an improvement  
on performance, but did not remove the occasional observed pauses.


I was of the (mis)understanding that only metadata and writes smaller  
than 64k went via the slog device in the event of an O_SYNC write  
request?


The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via the  
slog device?


I have investigated using the logbias setting, but that will just kill  
small file performance also on any filesystem using it and defeat the  
purpose of having a slog device at all.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever
I’m experiencing occasional slow responsiveness on an OpenSolaris b118  
system typically noticed when running an ‘ls’ (no extra flags, so no  
directory service lookups).  There is a delay of between 2 and 30  
seconds but no correlation has been noticed with load on the server  
and the slow return.  This problem has only been noticed via NFS (v3.   
We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been  
integrated - anticipated for snv_124).  The problem has been observed  
both locally on the primary filesystem, in an locally automounted  
reference (/home/foo) and remotely via NFS.


zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078  
w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with  
2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc  
behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).


The system is configured as an NFS (currently serving NFSv3), iSCSI  
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with authentication taking place from a remote openLDAP server.


Automount is in use both locally and remotely (linux clients).   
Locally /home/* is remounted from the zpool, remotely /home and  
another filesystem (and children) are mounted using autofs.  There was  
some suspicion that automount is the problem, but no definitive  
evidence as of yet.


The problem has definitely been observed with stats (of some form,  
typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* and  
locally in /zpool/home/* (the true source location).  There is a clear  
correlation with recency of reads of the directories in question and  
reoccurrence of the fault in that one user has scripted a regular (15m/ 
30m/hourly tests so far) ‘ls’ of the filesystems of interested and  
this has reduced the fault to have minimal noted impact since starting  
down this path (just for themself).


I have removed the l2arc(s) (cache devices) from the pool and the same  
behaviour has been observed.  My suspicion here was that there was  
perhaps occasional high synchronous load causing heavy writes to the  
slog devices and when a stat was requested it may have been faulting  
from ARC to L2ARC prior to going to the primary data store.  The  
slowness has been reported since removing the extra cache devices.


Another thought I had was along the lines of fileystem caching and  
heavy writes causing read blocking.  I have no evidence that this is  
the case, but some suggestions on list recently of limiting the ZFS  
memory usage for write caching.  Can anybody comment to the  
effectiveness of this (I have 256MB write cache in front of the slog  
SSDs and 512MB in front of the primary storage devices).


My DTrace is very poor but I’m suspicious that this is the best way to  
root cause this problem.  If somebody has any code that may assist in  
debugging this problem and was able to share it would much appreciated.


Any other suggestions for how to identify this fault and work around  
it would be greatly appreciated.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 6:24 AM, Richard Elling wrote:


On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:


On Sun, Sep 6, 2009 at 9:15 AM, James Leverj...@jamver.id.au wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris  
b118

system typically noticed when running an ‘ls’ (no extra flags, so no
directory service lookups).  There is a delay of between 2 and 30  
seconds
but no correlation has been noticed with load on the server and  
the slow
return.  This problem has only been noticed via NFS (v3.  We are  
migrating
to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
anticipated for

snv_124).  The problem has been observed both locally on the primary
filesystem, in an locally automounted reference (/home/foo) and  
remotely via

NFS.


I'm confused.  If This problem has only been noticed via NFS (v3  
then

how is it observed locally?”


Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI.

It has been observed in client:/home/user (NFSv3 automount from  
server:/home/user, redirected to server:/zpool/home/user) and also in  
server:/home/user (local automount) and server:/zpool/home/user  
(origin).



iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.


What specifically should I be looking for here? (using ‘iostat -xen -T  
d’)  and I’m guessing I’ll require a high level of granularity (1s  
intervals) to see the issue if it is a single disk or similar.



stat(2) doesn't write, so you can stop worrying about the slog.


My concern here was I may have been trying to write (via other  
concurrent processes) at the same time as there was a memory fault  
from the ARC to L2ARC.



Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.


No errors or collisions from either server or clients observed.


That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
Physical Memory Control Using the Resource Capping  Daemon
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones


Thanks Richard, I’ll have a look at that today and see where I get.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 11:08 AM, Richard Elling wrote:


Ok, just so I am clear, when you mean local automount you are
on the server and using the loopback -- no NFS or network involved?


Correct.  And the behaviour has been seen locally as well as remotely.


You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck  1 and the asvc_t   
1000


Just started having some slow responsiveness reported form a user  
using emacs (autosave, start of a build) so a small file write request.


The second or so before they went to do this, it appears as if the  
raid cache in front of the slog devices was nearly filled and the SSDs  
were being utilised quite heavily, but then there was a break where I  
am seeing relatively light usage on the slog but 100% busy on the  
device reported.


The iostat output is at the end of this message - I can’t make any  
real sense out of why a user would have seen a ~4s delay at about  
2:39:17-18.  Only one of the two slog devices are being used at all.   
Is there some tunable about how multiple slogs are used?


c7t[01] are rpool
c7t[23] are slog devices in the data pool
c11t* are the primary storage devices for the data pool

cheers,
James

Monday,  7 September 2009  2:39:17 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.0 1475.00.0 188799.0  0.0 30.20.0   20.5   2  90   0
0   0   0 c7t2d0
0.0  232.00.0 29571.8  0.0 33.80.0  145.9   0  98   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t4d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t5d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t6d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t8d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t9d0

Monday,  7 September 2009  2:39:18 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.00.00.00.0  0.0 35.00.00.0   0 100   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t2d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t4d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t5d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t6d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t8d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t9d0

Monday,  7 September 2009  2:39:19 PM EST
extended device statistics     
errors ---
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
0.00.00.00.0  0.0  0.00.00.0   0   0   0   
10   0  10 c9t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t1d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c7t2d0
0.0  341.00.0 43650.1  0.0 35.00.0  102.5   0 100   0
0   0   0 c7t3d0
0.00.00.00.0  0.0  0.00.00.0   0   0   0
0   0   0 c11t0d0
0.0  

Re: [zfs-discuss] periodic slow responsiveness

2009-09-06 Thread James Lever


On 07/09/2009, at 10:46 AM, Ross Walker wrote:

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ 
E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as  
l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with  
PERC 6/i).


This config might lead to heavy sync writes (NFS) starving reads due  
to the fact that the whole RAIDZ2 behaves as a single disk on  
writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?


Just one or two other vdevs to spread the load can make the world of  
difference.


This was a management decision.  I wanted to go down the striped  
mirrored pair solution, but the amount of space lost was considered  
too great.  RAIDZ2 was considered the best value option for our  
environment.


The system is configured as an NFS (currently serving NFSv3), iSCSI  
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with authentication taking place from a remote openLDAP server.


There are a lot of services here, all off one pool? You might be  
trying to bite off more then the config can chew.


That’s not a lot of services, really.  We have 6 users doing builds on  
multiple platforms and using the storage as their home directory  
(windows and unix).


The issue is interactive responsiveness and if there is a way to tune  
the system to give that while still having good performance for builds  
when they are run.


Try taking a particularly bad problem station and configuring it  
static for a bit to see if it is.


That has been considered also, but the issue has also been observed  
locally on the fileserver.


That doesn't make a lot of sense to me the L2ARC is secondary read  
cache, if writes are starving reads then the L2ARC would only help  
here.


I was suggesting that slog write were possibly starving reads from the  
l2arc as they were on the same device.  This appears not to have been  
the issue as the problem has persisted even with the l2arc devices  
removed from the pool.


It just may be that the pool configuration just can't handle the  
write IOPS needed and reads are starving.


Possible, but hard to tell.  Have a look at the iostat results I’ve  
posted.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-01 Thread James Lever


On 02/09/2009, at 9:54 AM, Adam Leventhal wrote:

After investigating this problem a bit I'd suggest avoiding  
deploying RAID-Z
until this issue is resolved. I anticipate having it fixed in build  
124.


Thanks for the status update on this Adam.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-08-28 Thread James Lever


On 28/08/2009, at 3:23 AM, Adam Leventhal wrote:

There appears to be a bug in the RAID-Z code that can generate  
spurious checksum errors. I'm looking into it now and hope to have  
it fixed in build 123 or 124. Apologies for the inconvenience.


Are the errors being generated likely to cause any significant problem  
running 121 with a RAID-Z volume or should users of RAID-Z* wait until  
this issue is resolved?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send/receive and compression

2009-08-22 Thread James Lever
Is there a mechanism by which you can perform a zfs send | zfs receive  
and not have the data uncompressed and recompressed at the other end?


I have a gzip-9 compressed filesystem that I want to backup to a  
remote system and would prefer not to have to recompress everything  
again at such great computation expense.


If this doesn't exist, how would one go about creating an RFE for it?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need tips on zfs pool setup..

2009-08-04 Thread James Lever


On 04/08/2009, at 9:42 PM, Joseph L. Casale wrote:


I noticed a huge improvement when I moved a virtualized pool
off a series of 7200 RPM SATA discs to even 10k SAS drives.
Night and day...


What I would really like to know is if it makes a big difference  
comparing say 7200RPM drives in mirror+stripe mode vs 15kRPM drives in  
raidz2


And how much of a difference raidz2 is compared to mirror+stripe in a  
contentious multi-client environment.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-04 Thread James Lever


On 05/08/2009, at 10:36 AM, Carson Gaspar wrote:

Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support  
recently.


Yep, it's a mega raid device.

I have been using one with a Samsung SSD in RAID0 mode (to avail  
myself of the cache) recently with great success.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-04 Thread James Lever


On 05/08/2009, at 11:36 AM, Ross Walker wrote:


Which model?


PERC 6/E w/512MB BBWC.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-04 Thread James Lever


On 05/08/2009, at 11:41 AM, Ross Walker wrote:


What is your recipe for these?


There wasn't one! ;)

The drive I'm using is a Dell badged Samsung MCCOE50G5MPQ-0VAD3.

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] ZFS and deduplication

2009-08-03 Thread James Lever

Nathan Hudson-Crim,

On 04/08/2009, at 8:02 AM, Nathan Hudson-Crim wrote:

Andre, I've seen this before. What you have to do is ask James each  
question 3 times and on the third time he will tell the truth. ;)


I know this is probably meant to be seen as a joke, but it's clearly  
in very poor taste and extremely discourteous and rude to make public  
statements to the effect of:


James McPherson is a liar and we should public berate him until he  
tells us what we want to hear regardless of the real situation of  
which I have no information other than what I want to believe.


Really!  Please actually think before you post.

(another) James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] feature proposal

2009-07-30 Thread James Lever

Hi Darryn,

On 30/07/2009, at 6:33 PM, Darren J Moffat wrote:

That already works if you have the snapshot delegation as that  
user.  It even works over NFS and CIFS.


Can you give us an example of how to correctly get this working?

I've read through the manpage but have not managed to get the correct  
set of permissions for it to work as a normal user (so far).


I'm sure others here would be keen to see a correct recipe to allow  
user managed snapshots remotely via mkdir/rmdir.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] zfs issues?

2009-07-29 Thread James Lever


On 29/07/2009, at 12:00 AM, James Lever wrote:

CR 6865661 *HOT* Created, P1 opensolaris/triage-queue zfs scrub  
rpool causes zpool hang


This bug I logged has been marked as related to CR 6843235 which is  
fixed in snv 119.


cheers,
James
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] zfs issues?

2009-07-28 Thread James Lever

Thanks for that Brian.

I've logged a bug:

CR 6865661 *HOT* Created, P1 opensolaris/triage-queue zfs scrub rpool  
causes zpool hang




Just discovered after trying to create a further crash dump that it's  
failing and rebooting with the following error (just caught it prior  
to the reboot):




panic dump timeout



so I'm not sure how else to assist with debugging this issue.



cheers,

James



On 28/07/2009, at 9:08 PM, Brian Ruthven - Solaris Network Sustaining  
- Sun UK wrote:



Yes:

$systemdump

should do the trick.

Make sure your dumpadm is set up beforehand to enable savecore, and  
that you have a dump device. In my case the output looks like this:


$ pfexec dumpadm
Dump content: kernel pages
 Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/opensolaris
Savecore enabled: yes


Then you should get a dump saved in /var/crash/hostname on next  
reboot.


Brian


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] zfs issues?

2009-07-27 Thread James Lever


On 28/07/2009, at 6:44 AM, dick hoogendijk wrote:


Are there any known issues with zfs in OpenSolaris B118?
I run my pools formatted like the original release 2009.06 (I want to
be able to go back to it ;-). I'm a bit scared after reading about
serious issues in B119 (will be skipped, I heard). But B118 is safe?


Well, actually, I have an issue with ZFS under b118 on osol.

Under b117, I attached a second disk to my root pool and confirmed  
everything worked fine.  Rebooted with the disks in reverse order to  
prove grub install worked and everything was fine.  Removed one of the  
spindles, did an upgrade to b118, rebooted and tested and then  
rebooted and added the removed volume, this was an explicit test of  
automated resilvering and it worked perfectly.  Did one or two  
explicit scrubs along the way and they were fine too.


So then I upgrade my zpool from version 14 to version 16 and now zpool  
scrub rpool hangs the ZFS subsystem.  The machine still runs, it's  
pingable etc, but anything that goes to disk (at least rpool) hangs  
indefinitely.  This happens whether I boot with the mirror in tact or  
degraded with one spindle removed.


I had help trying to create a crash dump, but everything we tried  
didn't cause the system to panic.  0eip;:c;:c and other weird magic I  
don't fully grok


Has anybody else seen this weirdness?

cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] zfs issues?

2009-07-27 Thread James Lever


On 28/07/2009, at 9:22 AM, Robert Thurlow wrote:


I can't help with your ZFS issue, but to get a reasonable crash
dump in circumstances like these, you should be able to do
savecore -L on OpenSolaris.


That would be well and good if I could get a login - due to the rpool  
being unresponsive, that was not possible.


So the only recourse we had was via kmdb :/  Is there a way to  
explicitly invoke savecore via kmdb?


James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] deduplication

2009-07-14 Thread James Lever


On 15/07/2009, at 1:51 PM, Jean Dion wrote:

Do we know if this web article will be discuss at Brisbane Australia  
the conference this week?


http://www.pcworld.com/article/168428/sun_tussles_with_deduplication_startup.html?tk=rss_news

I do not expect details but at least Sun position on this instead of  
letting peoples on rumors like published in this article.


Any replay and materials from this conferences?


There is a ustream feed that's live now at:

http://www.ustream.tv/channel/kernel-conference-australia

The conference is being recorded as well and will likely be re-encoded  
and uploaded somewhere down the track.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-07 Thread James Lever


On 07/07/2009, at 8:20 PM, James Andrewartha wrote:

Have you tried putting the slog on this controller, either as an SSD  
or
regular disk? It's supported by the mega_sas driver, x86 and amd64  
only.


What exactly are you suggesting here?  Configure one disk on this  
array as a dedicated ZIL?  Would that improve performance any over  
using all disks with an internal ZIL?


I have now done some tests with the PERC6/E in both RAID10 (all  
devices RAID0 LUNs, ZFS mirror/striped config) and also as a hardware  
RAID5 both with an internal ZIL.


RAID10 (10 disks, 5 mirror vdevs)
create 2m14.448s
unlink  0m54.503s

RAID5 (9 disks, 1 hot spare)
create 1m58.819s
unlink 0m48.509s

Unfortunately, linux on the same RAID5 array using XFS seems  
significantly faster, still.


Linux RAID5 (9 disks, 1 hot spare), XFS
create 1m30.911s
unlink 0m38.953s

Is there a way to disable the write barrier in ZFS in the way you can  
with Linux filesystems (-o barrier=0)?  Would this make any difference?


After much consideration, the lack of barrier capability makes no  
difference to filesystem stability in the scenario where you have a  
battery backed write cache.


Due to using identical hardware and configurations, I think this is a  
fair apples to apples test now.  I'm now wondering if XFS is just the  
faster filesystem... (not the most practical management solution, just  
speed).


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread James Lever


On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:

It seems like you may have selected the wrong SSD product to use.  
There seems to be a huge variation in performance (and cost) with so- 
called enterprise SSDs.  SSDs with capacitor-backed write caches  
seem to be fastest.


Do you have any methods to correctly measure the performance of an  
SSD for the purpose of a slog and any information on others (other  
than anecdotal evidence)?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] surprisingly poor performance

2009-07-05 Thread James Lever


On 05/07/2009, at 1:57 AM, Ross Walker wrote:


Barriers are by default are disabled on ext3 mounts... Google it and
you'll see interesting threads in the LKML. Seems there was some
serious performance degradation in using them. A lot of decisions in
Linux are made in favor of performance over data consistency.


After doing a fair bit of reading about linux and write barriers, I'm  
sure that it's an issue for traditional direct attach storage and for  
non-battery backed write cache in raid cards when cache is enabled.


Is it actually an issue if you have a hardware raid controller w/ BBWC  
enabled and the cache disabled on the HDDs? (i.e. correctly configured  
for data safety)


Should a correctly performing raid card be ignoring barrier write  
requests because it is already on stable storage?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-05 Thread James Lever


On 06/07/2009, at 9:31 AM, Ross Walker wrote:

There are two types of SSD drives on the market, the fast write SLC  
(single level cell) and the slow write MLC (multi level cell). MLC  
is usually used in laptops as SLC drives over 16GB usually go for  
$1000+ which isn't cost effective in a laptop. MLC is good for read  
caching though and most use it for L2ARC.


I just ordered a bunch of 16GB Imation Pro 7500's (formerly Mtron)  
from CDW lately for $290 a pop. They are suppose to be fast  
sequential write SLC drives and so-so random write. We'll see.


That will be interesting to see.

The Samsung drives we have are 50GB (64GB) SLC and apparently 2nd  
generation.


For a slog, is random write even an issue?  Or is it just the  
mechanism used to measure the IOPS performance of a typical device?


AFAIUI, the ZIL is used as a ring buffer.  How does that work with an  
SSD?  All this pain really makes me think the only sane slog is one  
that is RAM based and has enough capacitance to either make itself  
permanent or move the data to something permanent before failing  
(FusionIO, DDRdrive, for example).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever

Hej Henrik,

On 03/07/2009, at 8:57 PM, Henrik Johansen wrote:


Have you tried running this locally on your OpenSolaris box - just to
get an idea of what it could deliver in terms of speed ? Which NFS
version are you using ?


Most of the tests shown in my original message are local except the  
explicitly NFS based Metadata test shown at the very end (100k 0b  
files).  The 100k/0b test is an atomic test locally due to caching  
semantics and a lack of 100k explicit SYNC requests so the  
transactions are able to be bundled together and written in one block.


I've just been using NFSv3 so far for these tests as it it widely  
regarded as faster, even though less functional.


cheers,
James


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever

Hi Mertol,

On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote:

ZFS SSD usage behaviour heavly depends on access pattern and for  
asynch ops ZFS will not use SSD's.   I'd suggest you to disable  
SSD's , create a ram disk and use it as SLOG device to compare the  
performance. If performance doesnt change, it means that the  
measurement method have some flaws or you havent configured Slog  
correctly.


I did some tests with a ramdisk slog and the the write IOPS seemed to  
run about the 4k/s mark vs about 800/s when using the SSD as slog and  
200/s without a slog.


# osol b117 RAID10+ramdisk slog
#
bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee /root/zeroes- 
test-scalzi-dell-ramdisk_slog.txt

# tar
real1m32.343s
# rm
real0m44.418s

# linux+XFS on Hardware RAID
bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee /root/ 
zeroes-test-linux-lsimegaraid_bbwc.txt

#tar
real2m27.791s
#rm
real0m46.112s

Please note that SSD's are way slower then DRAM based write cache's.  
SSD's will show performance increase when you create load from  
multiple clients at the same time, as ZFS will be flushing the dirty  
cache sequantialy.  So I'd suggest running the test from a lot of  
clients simultaneously


I'm sure that it will be a more performant system in general, however,  
it is this explicit set of tests that I need to maintain or improve  
performance on.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 03/07/2009, at 10:37 PM, Victor Latushkin wrote:

Slog in ramdisk is analogous to no slog at all and disable zil  
(well, it may be actually a bit worse). If you say that your old  
system is 5 years old difference in above numbers may be due to  
difference in CPU and memory speed, and so it suggests that your  
Linux NFS server appears to be working at the memory speed, hence  
the question. Because if it does not honor sync semantics you are  
really comparing apples with oranges here.


The slog in ramdisk is in no way similar to disabling the ZIL.  This  
is an NFS test, so if I had disabled the ZIL, writes would have to go  
direct to disk (not ZIL) before returning, which would potentially be  
even slower than ZIL on zpool.


The appearance of the Linux NFS server appearing to perform at memory  
speed may just be the BBWC in the LSI MegaRaid SCSI card.  One of the  
developers here had explicitly performed tests to check these similar  
assumptions and found no evidence that the Linux/XFS sync  
implementation to be lacking even though there were previous issues  
with it in one kernel revision.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 04/07/2009, at 10:42 AM, Ross Walker wrote:

XFS on LVM or EVMS volumes can't do barrier writes due to the lack  
of barrier support in LVM and EVMS, so it doesn't do a hard cache  
sync like it would on a raw disk partition which makes the numbers  
higher, BUT with battery backed write cache the risk is negligible,  
but the numbers are higher then those on file systems that do do a  
hard cache sync.


Do you have any references for this?  and perhaps some published  
numbers that you may have seen?


Try XFS on a raw partition and NFS with sync writes enabled and see  
how it performs then.


I cannot do this on the existing fileserver and do not have another  
system with a BBWC card to test against.  The BBWC on the LSI MegaRaid  
is certainly the key factor here, I would expect.


I can test this assumption on this new hardware next week when I do a  
number of other tests and compare linux/XFS and perhaps remove LVM  
(though, I don't see why you would remove LVM from the equation).


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] surprisingly poor performance

2009-07-03 Thread James Lever


On 04/07/2009, at 1:49 PM, Ross Walker wrote:


I ran some benchmarks back when verifying this, but didn't keep them
unfortunately.

You can google: XFS Barrier LVM OR EVMS and see the threads about  
this.


Interesting reading.  Testing seems to show that either it's not  
relevant or there is something interesting going on with ext3 as a  
separate case.



When you do send me a copy, try both on a straight partition then on a
LVM volume and always use NFS sync, but when exporting use the
no_wdelay option if you don't already that eliminates slow downs with
NFS sync on Linux.



The numbers below seem to indicate that either there is no barrier  
issues here, or the BBWC in the raid controller makes them more-or- 
less invisible as the ext3fs volume below is directly onto the exposed  
LUN while the xfs partition is on top of LVM2.


It does, however, show that xfs is much faster for deletes.

cheers,
James

bash-3.2# cd /nfs/xfs_on_LVM
bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf  
zeroes/ ; date ) 21

Sat Jul  4 15:31:13 EST 2009

real0m18.145s
user0m0.055s
sys 0m0.500s
Sat Jul  4 15:31:31 EST 2009

real0m4.585s
user0m0.004s
sys 0m0.261s
Sat Jul  4 15:31:36 EST 2009

bash-3.2# cd /nfs/ext3
bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf  
zeroes/ ; date )

Sat Jul  4 15:32:43 EST 2009

real0m15.509s
user0m0.048s
sys 0m0.508s
Sat Jul  4 15:32:59 EST 2009

real0m37.793s
user0m0.006s
sys 0m0.225s
Sat Jul  4 15:33:37 EST 2009

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPARC SATA, please.

2009-06-25 Thread James Lever


On 25/06/2009, at 5:16 AM, Miles Nordin wrote:


and mpt is the 1068 driver, proprietary, works on x86 and SPARC.



then there is also itmpt, the third-party-downloadable closed-source
driver from LSI Logic, dunno much about it but someone here used it.


I'm confused.  Why do you say the mpt driver is proprietary and the  
LSI provided tool is closed source?


I thought they were both closed source and that the LSI chipset  
specifications were proprietary.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cutting up a SSD for read/log use...

2009-06-21 Thread James Lever

Hi Erik,

On 22/06/2009, at 1:15 PM, Erik Trimble wrote:

I just looked at pricing for the higher-end MLC devices, and it  
looks like I'm better off getting a single drive of 2X capacity than  
two with X capacity.


Leaving aside the issue that by using 2 drives I get  2 x 3.0Gbps  
SATA performance instead of 1 x 3.0Gbps,  are there problems with  
using two slices instead of whole-drives?  That is, one slice for  
Read and the other for ZIL?


The benefit you will get using 2 drives instead of 1 will be doubling  
your IOPS which will improve your overall performance, especially when  
using those drives as ZILs.


Are you planning on using these drives as primary data storage and ZIL  
for the same volumes or as primary storage for (say) your rpool and  
ZIL for a data pool on spinning metal?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to do backup

2009-06-20 Thread James Lever


On 20/06/2009, at 9:55 PM, Charles Hedrick wrote:

I have a USB disk, to which I want to do a backup. I've used send |  
receive. It works fine until I try to reboot. At that point the  
system fails to come up because the backup copy is set to be mounted  
at the original location so the system tries to mount two different  
things the same place. I guess I can have the script set  
mountpoint=none, but I'd think there would be a better approach.


Would a zpool export $backup_pool do the trick?  (and consequently,  
you import the USB zpool before you start your backups?


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss