Re: [zfs-discuss] IO load questions

2012-07-25 Thread Matt Breitbach
Trey,
Thanks for the enlightening info.  I was really hoping that this
system could deliver more NFS IOPS out of RAM, but based on your results I'm
guessing that's just not possible with my hardware.  Per chance have you
tried any of the software FCoE drivers for OI with your Intel x520 and
gotten any results there?  I'm currently attached to a Nexus 5010 w/ no
storage licensure, so I can't test FCoE right now - moving to the same
(5548) switches you have next week to get some FCoE tests going.  Would love
to see FCoE results, or anyone running RoCE/IB setups utilizing RDMA.


-Original Message-
From: Palmer, Trey [mailto:trey.pal...@gtri.gatech.edu] 
Sent: Wednesday, July 25, 2012 8:22 PM
To: Richard Elling; Matt Breitbach
Cc: zfs-discuss@opensolaris.org
Subject: RE: [zfs-discuss] IO load questions

BTW these SSD's are 480GB Talos 2's.


From: Palmer, Trey
Sent: Wednesday, July 25, 2012 9:20 PM
To: Richard Elling; Matt Breitbach
Cc: zfs-discuss@opensolaris.org
Subject: RE: [zfs-discuss] IO load questions

Matt,

I've been testing an all-SSD array with Filebench.As Richard implied, I
think your results are about what you can expect for NFS.   My results on
faster hardware are not blowing yours away.

I've been testing 8K records, but I tried 4K a few times (with 4K recordsize
natch) without that much improvement.

I have found that the hardware (CPU's) makes a pretty big difference.

My test ZFS server is:

OI 151a5
HP Gen8, 2 x Xeon E5-2630, 384GB RAM
2 LSI 9205-8e's
Supermicro SC417 JBOD with 3 24x2.5 dual-port SAS backplanes
40 OCZ SSD's split between 2 SAS expanders, connected to separate SAS cards
mirrored zpool, recordsize=8K, atime=off, sync=disabled,
primarycache=metadata filebench directio=1, 32-128 total threads

Server and clients are single-connected via Intel X520 to a Nexus 5548.

I tested with three different NFS clients, all running OI151a5 or Solaris
11.   Here are the best results I got for read-only and ~70/30 read/write:

Dual-5530:   53K read, 36/15K read/write
Sparc T4-1:   62K read,  40/18K read/write
Dual E5-2630:   86K read, 49/23K read/write

On the local server I get these results:

168K read
76K write
115/45K read/write
85/62K read/write

Just for my own edification I set primarycache=all, directio=0 and ran read
tests on local pools all three machines.   This really shows the difference
made by the hardware.  Peak rates were:

T4-1   397K
Dual-E5  345K
Dual-5530  182K

Also latencies go up as you go down the chart.The T4-1 and dual-E5
reached peak results at 64/72 threads, the dual-G6 didn't scale above 24.

The E5 ZFS server can do uncached reads from the zfs pool almost as fast as
the dual-5530 can read from memory!!! (though latencies are much higher, 0.7
vs 0.1 ms).

The T4 is pretty impressive for even moderately threaded workloads, in this
test keeping up with the E5 at 8-12 threads and passing it handily at 24.
A giant leap over Niagara 2.
iperf shows the T4's network throughput to be slower than the E5's, which
likely explains it being slower for NFS but faster from memory.But we
don't have the mezzanine cards, it's using a likely-suboptimal X520.

 -- Trey




From: zfs-discuss-boun...@opensolaris.org
[zfs-discuss-boun...@opensolaris.org] on behalf of Richard Elling
[richard.ell...@gmail.com]

Sent: Wednesday, July 25, 2012 11:05 AM

To: Matt Breitbach

Cc: zfs-discuss@opensolaris.org

Subject: Re: [zfs-discuss] IO load questions



On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote:

NFS - iSCSI and FC/FCoE to come once I get it into the proper lab.

ok, so NFS for these tests.

I'm not convinced a single ESXi box can drive the load to saturate 10GbE.

Also, depending on how you are configuring the system, the I/O that you
think is 4KB might look very different coming out of ESXi. Use nfssvrtop or
one of the many dtrace one-liners for observing NFS traffic to see what is
really on the wire. And I'm very interested to know if you see 16KB reads
during the "write-only" workload.

more below...


From: Richard Elling [mailto:richard.ell...@gmail.com]

Sent: Tuesday, July 24, 2012 11:36 PM

To: matth...@flash.shanje.com

Cc: zfs-discuss@opensolaris.org

Subject: Re: [zfs-discuss] IO load questions





Important question, what is the interconnect? iSCSI? FC? NFS?


 -- richard







On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote:


Working on a POC for high IO workloads, and I'm running in to a bottleneck
that I'm not sure I can solve.  Testbed looks like this :


SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU's, 72GB RAM, and ESXi VM -
4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010

Target Nexenta system :

Intel barebones, Dual Xeon 5620 CPU's, 192GB RAM, Nexenta 3.1.3 Enterprise
Intel x520 dual port 10Gbit Ethernet - LACP Active VPC to Nexus 5010
switches.
2x LSI 9201-16E HBA's, 1x LSI 9200-8e HBA
5 DAE's (3 in use for this test)
1 DAE - connected (

Re: [zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Cindy Swearingen

Hi--

I guess I can't begin to understand patching.

Yes, you provided a whole disk to zpool create but it actually
creates a part(ition) 0 as you can see in the output below.

Part  TagFlag First SectorSizeLast Sector
   0  usrwm   256  19.99GB  41927902

Part  TagFlag First Sector Size Last Sector
0 usrwm   256   99.99GB  209700062

I'm sorry you had to recreate the pool. This *is* a must-have feature
and it is working as designed in Solaris 11 and with patch 148098-3 (or
whatever the equivalent is) in Solaris 10 as well.

Maybe its time for me to recheck this feature in current Solaris 10
bits.

Thanks,

Cindy



On 07/25/12 16:14, Habony, Zsolt wrote:

Thank you for your replies.

First, sorry for misleading info.  Patch 148098-03  indeed not included in 
recommended set, but trying to download it shows that 147440-15 obsoletes it
and 147440-19 is included in latest recommended patch set.
Thus time solves the problem elsewhere.

Just for fun, my case was:

A standard LUN used as a zfs filesystem, no redundancy (as storage already 
has), and no partition is used, disk is given directly to zpool.
# zpool status -oraarch
   pool: -oraarch
  state: ONLINE
  scan: none requested
config:

 NAME STATE READ WRITE CKSUM
 xx-oraarch   ONLINE   0 0 0
   c5t60060E800570B90070B96547d0  ONLINE   0 0 0

errors: No known data errors

Partitioning shows this.

partition>  pr
Current partition table (original):
Total disk sectors available: 41927902 + 16384 (reserved sectors)

Part  TagFlag First SectorSizeLast Sector
   0usrwm   256  19.99GB 41927902
   1 unassignedwm 0  0  0
   2 unassignedwm 0  0  0
   3 unassignedwm 0  0  0
   4 unassignedwm 0  0  0
   5 unassignedwm 0  0  0
   6 unassignedwm 0  0  0
   8   reservedwm  41927903   8.00MB 41944286


As I mentioned I did not partition it, "zpool create" did.  I had absolutely no 
idea how to resize these partitions, where to get the available number of sectors and how 
many should be skipped and reserved ...
Thus I backed up the 10G, destroyed zpool, created zpool (size was fine now) , 
restored data.

Partition looks like this now, I do not think I could have created it easily 
manually.

partition>  pr
Current partition table (original):
Total disk sectors available: 209700062 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
   0usrwm   256   99.99GB  209700062
   1 unassignedwm 0   0   0
   2 unassignedwm 0   0   0
   3 unassignedwm 0   0   0
   4 unassignedwm 0   0   0
   5 unassignedwm 0   0   0
   6 unassignedwm 0   0   0
   8   reservedwm 2097000638.00MB  209716446

Thank you for your help.
Zsolt Habony




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Habony, Zsolt
Thank you for your replies.

First, sorry for misleading info.  Patch 148098-03  indeed not included in 
recommended set, but trying to download it shows that 147440-15 obsoletes it
and 147440-19 is included in latest recommended patch set.
Thus time solves the problem elsewhere.

Just for fun, my case was:

A standard LUN used as a zfs filesystem, no redundancy (as storage already 
has), and no partition is used, disk is given directly to zpool.
# zpool status -oraarch
  pool: -oraarch
 state: ONLINE
 scan: none requested
config:

NAME STATE READ WRITE CKSUM
xx-oraarch   ONLINE   0 0 0
  c5t60060E800570B90070B96547d0  ONLINE   0 0 0

errors: No known data errors

Partitioning shows this.  

partition> pr
Current partition table (original):
Total disk sectors available: 41927902 + 16384 (reserved sectors)

Part  TagFlag First SectorSizeLast Sector
  0usrwm   256  19.99GB 41927902
  1 unassignedwm 0  0  0
  2 unassignedwm 0  0  0
  3 unassignedwm 0  0  0
  4 unassignedwm 0  0  0
  5 unassignedwm 0  0  0
  6 unassignedwm 0  0  0
  8   reservedwm  41927903   8.00MB 41944286


As I mentioned I did not partition it, "zpool create" did.  I had absolutely no 
idea how to resize these partitions, where to get the available number of 
sectors and how many should be skipped and reserved ...
Thus I backed up the 10G, destroyed zpool, created zpool (size was fine now) , 
restored data.  

Partition looks like this now, I do not think I could have created it easily 
manually.

partition> pr
Current partition table (original):
Total disk sectors available: 209700062 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm   256   99.99GB  209700062
  1 unassignedwm 0   0   0
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 2097000638.00MB  209716446

Thank you for your help.
Zsolt Habony



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Cindy Swearingen

Hi--

Patches are available to fix this so I would suggest that you
request them from MOS support.

This fix fell through the cracks and we tried really hard to
get it in the current Solaris 10 release but sometimes things
don't work in your favor. The patches are available though.

Relabeling disks on a live pool is not a recommended practice
so let's review other options but first some questions:

1. Is this a redundant pool?

2. Do you have an additional LUN (equivalent size) that you
could use as a spare?

What you could do is replace this existing LUN with a larger
LUN, if available. Then, reattach the original LUN and detach
the spare LUN but this depends on your pool configuration.

If requesting the patches is not possible and you don't have
a spare LUN, then please contact me directly. I might be able
to walk you through a more manual process.

Thanks,

Cindy


On 07/25/12 09:49, Habony, Zsolt wrote:

Hello,
There is a feature of zfs (autoexpand, or zpool online -e ) that it can 
consume the increased LUN immediately and increase the zpool size.
That would be a very useful ( vital ) feature in enterprise environment.

Though when I tried to use it, it did not work.  LUN expanded and visible in 
format, but zpool did not increase.
I found a bug SUNBUG:6430818 (Solaris Does Not Automatically Handle an Increase 
in LUN Size)
Bad luck.

Patch exists: 148098 but _not_ part of recommended patch set.  Thus my fresh 
install Sol 10 U9 with latest patch set still has the problem.  ( Strange that 
this problem
  is not considered high impact ... )

It mentiones a workaround :   zpool export, "Re-label the LUN using format(1m) 
command.", zpool import

Can you pls. help in that, what does that re-label mean ?
(As I need to ask downtime for the zone now ... , would like to prepare for 
what I need to do )

I used format utility in thousands of times, for organizing partitions, though I have no 
idea how I would "relabel" a disk.
Also I did not use format to label the disks, I gave the LUN to zpool directly, 
I would not dare to touch or resize any partition with format utility, not 
knowing what zpool wants to see there.

Have you experienced such problem, and do you know how to increase zpool after 
a LUN increase ?

Thank you in advance,
Zsolt Habony




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Sašo Kiselkov
On 07/25/2012 05:49 PM, Habony, Zsolt wrote:
> Hello,
>   There is a feature of zfs (autoexpand, or zpool online -e ) that it can 
> consume the increased LUN immediately and increase the zpool size.
> That would be a very useful ( vital ) feature in enterprise environment.
> 
> Though when I tried to use it, it did not work.  LUN expanded and visible in 
> format, but zpool did not increase.
> I found a bug SUNBUG:6430818 (Solaris Does Not Automatically Handle an 
> Increase in LUN Size) 
> Bad luck. 
> 
> Patch exists: 148098 but _not_ part of recommended patch set.  Thus my fresh 
> install Sol 10 U9 with latest patch set still has the problem.  ( Strange 
> that this problem 
>  is not considered high impact ... )
> 
> It mentiones a workaround :   zpool export, "Re-label the LUN using 
> format(1m) command.", zpool import
> 
> Can you pls. help in that, what does that re-label mean ?  
> (As I need to ask downtime for the zone now ... , would like to prepare for 
> what I need to do )
> 
> I used format utility in thousands of times, for organizing partitions, 
> though I have no idea how I would "relabel" a disk.
> Also I did not use format to label the disks, I gave the LUN to zpool 
> directly, I would not dare to touch or resize any partition with format 
> utility, not knowing what zpool wants to see there.
> 
> Have you experienced such problem, and do you know how to increase zpool 
> after a LUN increase ?

"Relabel" means simply running the labeling command in the format
utility after you've made changes to the slices. As long as you keep the
start cluster of a slice the same and don't shrink it, nothing bad
should happen.

Are you doing this on a root pool?

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Habony, Zsolt
Hello,
There is a feature of zfs (autoexpand, or zpool online -e ) that it can 
consume the increased LUN immediately and increase the zpool size.
That would be a very useful ( vital ) feature in enterprise environment.

Though when I tried to use it, it did not work.  LUN expanded and visible in 
format, but zpool did not increase.
I found a bug SUNBUG:6430818 (Solaris Does Not Automatically Handle an Increase 
in LUN Size) 
Bad luck. 

Patch exists: 148098 but _not_ part of recommended patch set.  Thus my fresh 
install Sol 10 U9 with latest patch set still has the problem.  ( Strange that 
this problem 
 is not considered high impact ... )

It mentiones a workaround :   zpool export, "Re-label the LUN using format(1m) 
command.", zpool import

Can you pls. help in that, what does that re-label mean ?  
(As I need to ask downtime for the zone now ... , would like to prepare for 
what I need to do )

I used format utility in thousands of times, for organizing partitions, though 
I have no idea how I would "relabel" a disk.
Also I did not use format to label the disks, I gave the LUN to zpool directly, 
I would not dare to touch or resize any partition with format utility, not 
knowing what zpool wants to see there.

Have you experienced such problem, and do you know how to increase zpool after 
a LUN increase ?

Thank you in advance,
Zsolt Habony




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IO load questions

2012-07-25 Thread Richard Elling
On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote:

> NFS – iSCSI and FC/FCoE to come once I get it into the proper lab.

ok, so NFS for these tests.

I'm not convinced a single ESXi box can drive the load to saturate 10GbE.

Also, depending on how you are configuring the system, the I/O that you 
think is 4KB might look very different coming out of ESXi. Use nfssvrtop
or one of the many dtrace one-liners for observing NFS traffic to see what is
really on the wire. And I'm very interested to know if you see 16KB reads
during the "write-only" workload.

more below...


> From: Richard Elling [mailto:richard.ell...@gmail.com] 
> Sent: Tuesday, July 24, 2012 11:36 PM
> To: matth...@flash.shanje.com
> Cc: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] IO load questions
>  
> Important question, what is the interconnect? iSCSI? FC? NFS?
>  -- richard
>  
> On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote:
> 
> 
> Working on a POC for high IO workloads, and I’m running in to a bottleneck 
> that I’m not sure I can solve.  Testbed looks like this :
> 
> SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU’s, 72GB RAM, and ESXi
> VM – 4GB RAM, 1vCPU
> Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010
> 
> Target Nexenta system :
> 
> Intel barebones, Dual Xeon 5620 CPU’s, 192GB RAM, Nexenta 3.1.3 Enterprise
> Intel x520 dual port 10Gbit Ethernet – LACP Active VPC to Nexus 5010 switches.
> 2x LSI 9201-16E HBA’s, 1x LSI 9200-8e HBA
> 5 DAE’s (3 in use for this test)
> 1 DAE – connected (multipathed) to LSI 9200-8e.  Loaded w/ 6x Stec ZeusRAM 
> SSD’s – striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC.
> 2 DAE’s connected (multipathed) to one LSI 9201-16E – 24x 600GB 15k Seagate 
> Cheetah drives
> Obviously data integrity is not guaranteed
> 
> Testing using IOMeter from windows guest, 10GB test file, queue depth of 64
> I have a share set up with 4k recordsizes, compression disabled, access time 
> disabled, and am seeing performance as follows :
> 
> ~50,000 IOPS 4k random read.  200MB/sec, 30% CPU utilization on Nexenta, ~90% 
> utilization on guest OS.  I’m guessing guest OS is bottlenecking.  Going to 
> try physical hardware next week
> ~25,000 IOPS 4k random write.  100MB/sec, ~70% CPU utilization on Nexenta, 
> ~45% CPU utilization on guest OS.  Feels like Nexenta CPU is bottleneck. Load 
> average of 2.5

For cases where you are not bandwidth limited, larger recordsizes can be more 
efficient. There
is no good rule-of-thumb for this, and larger recordsizes will, at some point, 
hit the bandwidth
bottlenecks. I've had good luck with 8KB and 32KB recordsize for ESXi+Windows 
over NFS.
I've never bothered to test 16KB, due to lack of time.

> A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec 
> performance, can’t remember CPU utilization on either side. Will retest and 
> report those numbers.

It would not surprise me to see a CPU bottleneck on the ESXi side at these 
levels.
 -- richard

> 
> It feels like something is adding more overhead here than I would expect on 
> the 4k recordsizes/IO workloads.  Any thoughts where I should start on this?  
> I’d really like to see closer to 10Gbit performance here, but it seems like 
> the hardware isn’t able to cope with it?
>  
> Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, 
> unidirectional.
> This workload is extraordinarily difficult to achieve with a single client 
> using any of the popular
> storage protocols.
>  -- richard
>  
> --
> ZFS Performance and Training
> richard.ell...@richardelling.com
> +1-760-896-4422
>  
>  
>  
>  
> 
> 
>  

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IO load questions

2012-07-25 Thread Matt Breitbach
NFS - iSCSI and FC/FCoE to come once I get it into the proper lab.

 

From: Richard Elling [mailto:richard.ell...@gmail.com] 
Sent: Tuesday, July 24, 2012 11:36 PM
To: matth...@flash.shanje.com
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] IO load questions

 

Important question, what is the interconnect? iSCSI? FC? NFS?

 -- richard

 

On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote:





Working on a POC for high IO workloads, and I'm running in to a bottleneck
that I'm not sure I can solve.  Testbed looks like this :

SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU's, 72GB RAM, and ESXi
VM - 4GB RAM, 1vCPU
Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010

Target Nexenta system :

Intel barebones, Dual Xeon 5620 CPU's, 192GB RAM, Nexenta 3.1.3 Enterprise
Intel x520 dual port 10Gbit Ethernet - LACP Active VPC to Nexus 5010
switches.
2x LSI 9201-16E HBA's, 1x LSI 9200-8e HBA
5 DAE's (3 in use for this test)
1 DAE - connected (multipathed) to LSI 9200-8e.  Loaded w/ 6x Stec ZeusRAM
SSD's - striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC.
2 DAE's connected (multipathed) to one LSI 9201-16E - 24x 600GB 15k Seagate
Cheetah drives
Obviously data integrity is not guaranteed

Testing using IOMeter from windows guest, 10GB test file, queue depth of 64
I have a share set up with 4k recordsizes, compression disabled, access time
disabled, and am seeing performance as follows :

~50,000 IOPS 4k random read.  200MB/sec, 30% CPU utilization on Nexenta,
~90% utilization on guest OS.  I'm guessing guest OS is bottlenecking.
Going to try physical hardware next week
~25,000 IOPS 4k random write.  100MB/sec, ~70% CPU utilization on Nexenta,
~45% CPU utilization on guest OS.  Feels like Nexenta CPU is bottleneck.
Load average of 2.5

A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec
performance, can't remember CPU utilization on either side. Will retest and
report those numbers.

It feels like something is adding more overhead here than I would expect on
the 4k recordsizes/IO workloads.  Any thoughts where I should start on this?
I'd really like to see closer to 10Gbit performance here, but it seems like
the hardware isn't able to cope with it?

 

Theoretical peak performance for a single 10GbE wire is near 300k IOPS @
4KB, unidirectional.

This workload is extraordinarily difficult to achieve with a single client
using any of the popular

storage protocols.

 -- richard

 

--

ZFS Performance and Training

richard.ell...@richardelling.com

+1-760-896-4422

 

 

 

 





 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss