Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-13 Thread Richard Elling
On Jun 8, 2010, at 12:46 PM, Miles Nordin wrote:

 re == Richard Elling richard.ell...@gmail.com writes:
 
re Please don't confuse Ethernet with IP.
 
 okay, but I'm not.  seriously, if you'll look into it.

[fine whine elided]
I think we can agree that the perfect network has yet to be invented :-)
Meanwhile, 6Gbps SAS switches are starting to hit the market... what fun :-)

re The latest OpenSolaris release is 2009.06 which treats all
re Zvol-backed COMSTAR iSCSI writes as sync. This was changed in
re the developer releases in summer 2009, b114.  For a release
re such as NexentaStor 3.0.2, which is based on b140 (+/-), the
re initiator's write cache enable/disable request is respected,
re by default.
 
 that helps a little, but it's far from a full enough picture to be
 useful to anyone IMHO.  In fact it's pretty close to ``it varies and
 is confusing'' which I already knew:
 
 * how do I control the write cache from the initiator?  though I
   think I already know the answer: ``it depends on which initiator,''
   and ``oh, you're using that one?  well i don't know how to do it
   with THAT initiator'' == YOU DON'T

For ZFS over a Solaris initiator, it is done with setting DKIOCSETWCE
via an ioctl.  Look on or near
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#276

I presume that this can also be set with format -e, as is done for other 
devices.  Has anyone else tried?

 
 * when the setting has been controlled, how long does it persist?
   Where can it be inspected?

RTFM stmfadm(1m) and look for wcd
small_rant
drives me nuts that some people prefer negatives (disables) over
positives (enables)
/small_rant

 
 * ``by default'' == there is a way to make it not respect the
   initiator's setting, and through a target shell command cause it to
   use one setting or the other, persistently?

See above.

 * is the behavior different for file-backed LUN's than zvol's?

Yes, it can be.  It can also be modified by the sync property.
See CR 6794730, need zvol support for DKIOCSETWCE and friends
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730

 I guess there is less point to figuring this out until the behavior is
 settled.

I think it is settled, but perhaps not well documented :-(
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-11 Thread Pasi Kärkkäinen
On Tue, Jun 08, 2010 at 08:33:40PM -0500, Bob Friesenhahn wrote:
 On Tue, 8 Jun 2010, Miles Nordin wrote:

 re == Richard Elling richard.ell...@gmail.com writes:

re Please don't confuse Ethernet with IP.

 okay, but I'm not.  seriously, if you'll look into it.

 Did you misread where I said FC can exert back-pressure?  I was
 contrasting with Ethernet.

 You're really confused, though I'm sure you're going to deny it.

 I don't think so.  I think that it is time to reset and reboot yourself 
 on the technology curve.  FC semantics have been ported onto ethernet.  
 This is not your grandmother's ethernet but it is capable of supporting 
 both FCoE and normal IP traffic.  The FCoE gets per-stream QOS similar to 
 what you are used to from Fibre Channel. Quite naturally, you get to pay 
 a lot more for the new equipment and you have the opportunity to discard 
 the equipment you bought already.


Yeah, today enterprise iSCSI vendors like Equallogic (bought by Dell)
_recommend_ using flow control. Their iSCSI storage arrays are designed
to work properly with flow control and perform well.

Of course you need a proper (certified) switches aswell.

Equallogic says the delays from flow control pause frames are shorter
than tcp retransmits, so that's why they're using and recommending it.

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-11 Thread Miles Nordin
 pk == Pasi Kärkkäinen pa...@iki.fi writes:

 You're really confused, though I'm sure you're going to deny
 it.

  I don't think so.  I think that it is time to reset and reboot
 yourself on the technology curve.  FC semantics have been
 ported onto ethernet.  This is not your grandmother's ethernet
 but it is capable of supporting both FCoE and normal IP
 traffic.  The FCoE gets per-stream QOS similar to what you are
 used to from Fibre Channel.

FCoE != iSCSI.

FCoE was not being discussed in the part you're trying to contradict.
If you read my entire post, I talk about FCoE at the end and say more
or less ``I am talking about FCoE here only so you don't try to throw
out my entire post by latching onto some corner case not applying to
the OP by dragging FCoE into the mix'' which is exactly what you did.
I'm guessing you fired off a reply without reading the whole thing?

pk Yeah, today enterprise iSCSI vendors like Equallogic (bought
pk by Dell) _recommend_ using flow control. Their iSCSI storage
pk arrays are designed to work properly with flow control and
pk perform well.

pk Of course you need a proper (certified) switches aswell.

pk Equallogic says the delays from flow control pause frames are
pk shorter than tcp retransmits, so that's why they're using and
pk recommending it.

please have a look at the three links I posted about flow control not
being used the way you think it is by any serious switch vendor, and
the explanation of why this limitation is fundamental, not something
that can be overcome by ``technology curve.''  It will not hurt
anything to allow autonegotiation of flow control on non-broken
switches so I'm not surprised they recommend it with ``certified''
known-non-broken switches, but it also will not help unless your
switches have input/backplane congestion which they usually don't, or
your end host is able to generate PAUSE frames for PCIe congestion
which is maybe more plausible.  In particular it won't help with the
typical case of the ``incast'' problem in the experiment in the FAST
incast paper URL I gave, because they narrowed down what was happening
in their experiment to OUTPUT queue congestion, which (***MODULO
FCoE*** mr ``reboot yourself on the technology curve'') never invokes
ethernet flow control.

HTH.

ok let me try again:

yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with
blocking storage-friendly buffer semantics if your FCoE/CEE switches
can manage that, but I would like to hear of someone actually DOING it
before we drag it into the discussion.  I don't think that's happening
in the wild so far, and it's definitely not the application for which
these products have been flogged.

I know people run iSCSI over IB (possibly with RDMA for moving the
bulk data rather than TCP), and I know people run SCSI over FC, and of
course SCSI (not iSCSI) over FCoE.  Remember the original assertion
was: please try FC as well as iSCSI if you can afford it.

Are you guys really saying you believe people are running ***iSCSI***
over the separate HOL-blocking hop-by-hop pause frame CoS's of FCoE
meshes?  or are you just spewing a bunch of noxious white paper
vapours at me?  because AIUI people using the
lossless/small-output-buffer channel of FCoE are running the FC
protocol over that ``virtual channel'' of the mesh, not iSCSI, are
they not?


pgp7HCeOuOq4h.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-11 Thread Pasi Kärkkäinen
On Fri, Jun 11, 2010 at 03:30:26PM -0400, Miles Nordin wrote:
  pk == Pasi Kärkkäinen pa...@iki.fi writes:
 
  You're really confused, though I'm sure you're going to deny
  it.
 
   I don't think so.  I think that it is time to reset and reboot
  yourself on the technology curve.  FC semantics have been
  ported onto ethernet.  This is not your grandmother's ethernet
  but it is capable of supporting both FCoE and normal IP
  traffic.  The FCoE gets per-stream QOS similar to what you are
  used to from Fibre Channel.
 
 FCoE != iSCSI.
 
 FCoE was not being discussed in the part you're trying to contradict.
 If you read my entire post, I talk about FCoE at the end and say more
 or less ``I am talking about FCoE here only so you don't try to throw
 out my entire post by latching onto some corner case not applying to
 the OP by dragging FCoE into the mix'' which is exactly what you did.
 I'm guessing you fired off a reply without reading the whole thing?
 
 pk Yeah, today enterprise iSCSI vendors like Equallogic (bought
 pk by Dell) _recommend_ using flow control. Their iSCSI storage
 pk arrays are designed to work properly with flow control and
 pk perform well.
 
 pk Of course you need a proper (certified) switches aswell.
 
 pk Equallogic says the delays from flow control pause frames are
 pk shorter than tcp retransmits, so that's why they're using and
 pk recommending it.
 
 please have a look at the three links I posted about flow control not
 being used the way you think it is by any serious switch vendor, and
 the explanation of why this limitation is fundamental, not something
 that can be overcome by ``technology curve.''  It will not hurt
 anything to allow autonegotiation of flow control on non-broken
 switches so I'm not surprised they recommend it with ``certified''
 known-non-broken switches, but it also will not help unless your
 switches have input/backplane congestion which they usually don't, or
 your end host is able to generate PAUSE frames for PCIe congestion
 which is maybe more plausible.  In particular it won't help with the
 typical case of the ``incast'' problem in the experiment in the FAST
 incast paper URL I gave, because they narrowed down what was happening
 in their experiment to OUTPUT queue congestion, which (***MODULO
 FCoE*** mr ``reboot yourself on the technology curve'') never invokes
 ethernet flow control.
 
 HTH.
 
 ok let me try again:
 
 yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with
 blocking storage-friendly buffer semantics if your FCoE/CEE switches
 can manage that, but I would like to hear of someone actually DOING it
 before we drag it into the discussion.  I don't think that's happening
 in the wild so far, and it's definitely not the application for which
 these products have been flogged.
 
 I know people run iSCSI over IB (possibly with RDMA for moving the
 bulk data rather than TCP), and I know people run SCSI over FC, and of
 course SCSI (not iSCSI) over FCoE.  Remember the original assertion
 was: please try FC as well as iSCSI if you can afford it.
 
 Are you guys really saying you believe people are running ***iSCSI***
 over the separate HOL-blocking hop-by-hop pause frame CoS's of FCoE
 meshes?  or are you just spewing a bunch of noxious white paper
 vapours at me?  because AIUI people using the
 lossless/small-output-buffer channel of FCoE are running the FC
 protocol over that ``virtual channel'' of the mesh, not iSCSI, are
 they not?

I was talking about iSCSI over TCP over IP over Ethernet. No FcOE. No IB.

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-11 Thread Bob Friesenhahn

On Fri, 11 Jun 2010, Miles Nordin wrote:


FCoE != iSCSI.

FCoE was not being discussed in the part you're trying to contradict.
If you read my entire post, I talk about FCoE at the end and say more
or less ``I am talking about FCoE here only so you don't try to throw
out my entire post by latching onto some corner case not applying to
the OP by dragging FCoE into the mix'' which is exactly what you did.
I'm guessing you fired off a reply without reading the whole thing?


I am deeply concerned that you are relying on your extensive 
experience with legacy ethernet technologies and have not done any 
research on modern technologies.


Entering FCoE into Google resulted in many useful hits which 
describe technologies which are ethernet but more advanced than the 
ethernet you generalized in your lengthy text.


For example

http://www.fcoe.com/
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-462176.html
http://www.brocade.com/products-solutions/solutions/connectivity/FCoE/index.page
http://www.emulex.com/products/converged-network-adapters.html

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Miles Nordin
 re == Richard Elling richard.ell...@gmail.com writes:

re Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

Ethernet output queues are either FIFO or RED, and are large compared
to FC and IB.  FC is buffer-credit, which HOL-blocks to prevent the
small buffers from overflowing, and IB is...blocking (almost no buffer
at all---about 2KB per port and bandwidth*delay product of about 1KB
for the whole mesh, compared to ARISTA which has about 48MB per port,
so except to pedantic IB is bufferless, ie it does not even buffer one
full frame).  Unlike Ethernet, both are lossless fabrics (sounds good)
and have an HOL-blocking character (sounds bad).  They're
fundamentally different at L2, so this is not about IP.  If you run IP
over IB, it is still blocking and lossless.  It does not magically
start buffering when you use IP because the fabric is simply unable to
buffer---there is no RAM in the mesh anywhere.  Both L2 and L3
switches have output queues, and both L3 and L2 output queues can be
FIFO or RED because the output buffer exists in the same piece of
silicon of an L3 switch no matter whether it's set to forward in L2 or
L3 mode, so L2 and L3 switches are like each other and unlike FC  IB.
This is not about IP.  It's about Ethernet.

a relevant congestion difference between L3 and L2 switches (confusing
ethernet with IP) might be ECN, because only an L3 switch can do ECN.
But I don't think anyone actually uses ECN.  It's disabled by default
in Solaris and, I think, all other Unixes.  AFAICT my Extreme
switches, a very old L3 flow-forwarding platform, are not able to flip
the bit.  I think 6500 can, but I'm not certain.

re no back-off other than that required for the link. Since
re GbE and higher speeds are all implemented as switched fabrics,
re the ability of the switch to manage contention is paramount.
re You can observe this on a Solaris system by looking at the NIC
re flow control kstats.

You're really confused, though I'm sure you're going to deny it.
Ethernet flow control mostly isn't used at all, and it is never used
to manage output queue congestion except in hardware that everyone
agrees is defective.  I almost feel like I've written all this stuff
already, even the part about ECN.

Ethernet flow control is never correctly used to signal output queue
congestion.  The ethernet signal for congestion is a dropped packet.
flow control / PAUSE frames are *not* part of some magic mesh-wide
mechanism by which switches ``manage'' congestion.  PAUSE are used,
when they're used at all, for oversubscribed backplanes: for
congestion on *input*, which in Ethernet is something you want to
avoid.  You want to switch ethernet frames to the output port where it
may or may not encounter congestion so that you don't hold up input
frames headed toward other output ports.  If you did hold them up,
you'd have something like HOL blocking.  IB takes a different
approach: you simply accept the HOL blocking, but tend to design a
mesh with little or no oversubscription unlike ethernet LAN's which
are heavily oversubscribed on their trunk ports.  so...the HOL
blocking happens, but not as much as it would with a typical Ethernet
topology, and it happens in a way that in practice probably increases
the performance of storage networks.

This is interesting for storage because when you try to shove a
128kByte write into an Ethernet fabric, part of it may get dropped in
an output queue somewhere along the way.  In IB, never will part of
the write get dropped, but sometimes you can't shove it into the
network---it just won't go, at L2.  With Ethernet you rely on TCP to
emulate this can't-shove-in condition, and it does not work perfectly
in that it can introduce huge jitter and link underuse (``incast'' problem:

 http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf

), and secondly leave many kilobytes in transit within the mesh or TCP
buffers, like tens of megabytes and milliseconds per hop, requiring
large TCP buffers on both ends to match the bandwidth*jitter and
frustrating storage QoS by queueing commands on the link instead of in
the storage device, but in exchange you get from Ethernet no HOL
blocking and the possibility of end-to-end network QoS.  It is a fair
tradeoff but arguably the wrong one for storage based on experience
with iSCSI sucking so far.

But the point is, looking at those ``flow control'' kstats will only
warn you if your switches are shit, and shit in one particular way
that even cheap switches rarely are.  The metric that's relevant is
how many packets are being dropped, and in what pattern (a big bucket
of them at once like FIFO, or a scattering like RED), and how TCP is
adapting to these drops.  For this you might look at TCP stats in
solaris, or at output queue drop and output queue size stats on
managed switches, or simply at the overall 

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Bob Friesenhahn

On Tue, 8 Jun 2010, Miles Nordin wrote:


re == Richard Elling richard.ell...@gmail.com writes:


   re Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

You're really confused, though I'm sure you're going to deny it.


I don't think so.  I think that it is time to reset and reboot 
yourself on the technology curve.  FC semantics have been ported onto 
ethernet.  This is not your grandmother's ethernet but it is capable 
of supporting both FCoE and normal IP traffic.  The FCoE gets 
per-stream QOS similar to what you are used to from Fibre Channel. 
Quite naturally, you get to pay a lot more for the new equipment and 
you have the opportunity to discard the equipment you bought already.


Richard is not out in the weeds although there are probably plenty of 
weeds growing at the ranch.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Erik Trimble

On 6/8/2010 6:33 PM, Bob Friesenhahn wrote:

On Tue, 8 Jun 2010, Miles Nordin wrote:


re == Richard Elling richard.ell...@gmail.com writes:


   re Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

You're really confused, though I'm sure you're going to deny it.


I don't think so.  I think that it is time to reset and reboot 
yourself on the technology curve.  FC semantics have been ported onto 
ethernet.  This is not your grandmother's ethernet but it is capable 
of supporting both FCoE and normal IP traffic.  The FCoE gets 
per-stream QOS similar to what you are used to from Fibre Channel. 
Quite naturally, you get to pay a lot more for the new equipment and 
you have the opportunity to discard the equipment you bought already.


Richard is not out in the weeds although there are probably plenty of 
weeds growing at the ranch.


Bob


Well, you saying we might want to put certain folks out to pasture?

wink

That said, I had a good look at FCoE about a year ago, and, unlike ATAoE 
which effectively ran over standard managed or smart switched, FCoE 
required specialized switch hardware that was non-trivially expensive.  
That said, it did seem to be a mature protocol implementation, so it was 
a viable option once the hardware price came down (and we had wider, 
better software implementations).


Also, FCoE really doesn't seem to play well with regular IP on the same 
link, so you really should dedicate a link (not necessarily a switch) to 
FCoE, and pipe your IP traffic via another link. It is NOT iSCSI.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Erik Trimble

Comments in-line.


On 6/6/2010 9:16 PM, Ken wrote:

I'm looking at VMWare, ESXi 4, but I'll take any advice offered.

On Sun, Jun 6, 2010 at 19:40, Erik Trimble erik.trim...@oracle.com 
mailto:erik.trim...@oracle.com wrote:


On 6/6/2010 6:22 PM, Ken wrote:

Hi,

I'm looking to build a virtualized web hosting server environment
accessing files on a hybrid storage SAN.  I was looking at using
the Sun X-Fire x4540 with the following configuration:

* 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM
  SATA drives)
* 2 Intel X-25 32GB SSD's as a mirrored ZIL
* 4 Intel X-25 64GB SSD's as the L2ARC.
* De-duplification
* LZJB compression

The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:

* Should I use NFS with all five VM's accessing the exports,
  or one LUN for each VM, accessed over iSCSI?

Generally speaking, it depends on your comfort level with running iSCSI  
Volumes to put the VMs in, or serving everything out via NFS (hosting 
the VM disk file in an NFS filesystem).


If you go the iSCSI route, I would definitely go the one iSCSI volume 
per VM route - note that you can create multiple zvols per zpool on the 
X4540, so it's not limiting in any way to volume-ize a VM.  It's a lot 
simpler, easier, and allows for nicer management (snapshots/cloning/etc. 
on the X4540 side) if you go with a VM per iSCSI volume.


With NFS-hosted VM disks, do the same thing:  create a single filesystem 
on the X4540 for each VM.


Performance-wise, I'd have to test, but I /think/ the iSCSI route will 
be faster. Even with the ZIL SSDs.


In all cases, regardless of how you host the VM images themselves, I'd 
serve out the website files via NFS.  I'm not sure how ESXi works, but 
under something like Solaris/Vbox, I could set up the base Solaris 
system to run CacheFS for an NFS share, and then give local access to 
all the VBox instances that single NFS mountpoint.  That would allow for 
heavy client-side cacheing of important data for your web servers.  If 
you're careful, you can separate read-only data from write-only data, 
which would allow you even better performance tweaks.  I tend to like to 
have the host OS handle as much network traffic and cacheing of data as 
possible instead of each VM doing it; it tends to be more efficient that 
way.




* Are the FSYNC speed issues with NFS resolved?

The ZIL SSDs will compensate for synchronous write issues in NFS.  Not 
completely eliminate them, but you shouldn't notice issues with sync 
writing until you're up at pretty heavy loads.



* Should I go with fiber channel, or will the 4 built-in 1Gbe
  NIC's give me enough speed?

Depending on how much RAM and how much local data caching you do (and 
the specifics of the web site accesses), 4 GBE should be fine.  However, 
if you want more, I'd get another quad GBE card, and then run at least 2 
guest instances per client hardware. Try very hard to have the 
equivalent of a full GBE available per VM.  Personally, I'd go for 
client hardware that has 4 GBE interfaces: (1) each for two VMs, 1 for 
external internet access, and 1 for management.  I'd then run the X4540 
with 8 GBE bonded (trunked/teamed/whatever) together.   This might be 
overkill, so see what your setup requires in terms of available bandwidth.



* How many SSD's should I use for the ZIL and L2ARC?

Being a website mux, your data pattern is likely to be 99% read with 
small random writes being the remaining 1%.  You need just enough 
high-performance SSD for the ZIL.  Honestly, the 32GB X25-E is larger 
than you'll likely ever need. I can't recommend anything else for the 
money, but the sad truth is that ZFS really only need a 1-2GB of NVRAM 
for the ZIL (for most use cases).  So get the smallest device you can 
find that still satisfies the high performance requirement.  Caveaut: 
look at the archives for all the talk about protecting your ZIL device 
from power outages (and the lack of a capacitor in most modern SSDs).


For L2ARC, go big. Website files tend to be /very/ small, so you're in 
the worst use case for Dedup. With something like a X4540 and it's huge 
data capacity, get as much L2ARC SSD space as you can afford. Remember:  
250bytes per Dedup block. If you have 1k blocks for all those little 
files, well, your L2ARC needs to be 25% of your data size. *Ouch*   Now, 
you don't have to buy the super-expensive stuff for L2ARC: the good old 
Intel X-25M works just fine.  Don't mirror them.


Given the explosive potential size of your DDT, I'd think long and hard 
about which data you really want to Dedup. Disk is cheap, but SSD 
isn't.  Good news is that you can selectively decide which data sets to 
Dedup. Ain't ZFS great?




* What pool structure should I use?

If it were me (and, given what little I know of your data), I'd go like 
this:


(1) pool for 

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Jens Elkner
On Sun, Jun 06, 2010 at 09:16:56PM -0700, Ken wrote:
I'm looking at VMWare, ESXi 4, but I'll take any advice offered.
...
I'm looking to build a virtualized web hosting server environment accessing
files on a hybrid storage SAN.  I was looking at using the Sun X-Fire x4540
with the following configuration:

IMHO Solaris Zones with LOFS mounted ZFSs gives you the highest
flexibility in all directions, probably the best performance and 
least resource consumption, fine grained resource management (CPU,
memory, storage space) and less maintainance stress etc...

Have fun,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Roy Sigurd Karlsbakk



Which Virtual Machine technology are you going to use? 

VirtualBox 
VMWare 
Xen 
Solaris Zones 
Somethinge else... 

It will make a difference as to my recommendation (or, do you want me to 
recommend a VM type, too?) 
This is somehow off-topic @zfs-discuss, but still. After trying to fight a bug 
- http://www.virtualbox.org/ticket/6505 - for months and getting close-to-zero 
feedback from the virtualbox developers, I have abandoned using vbox on 
OpenSolaris. It may work fine a few days, perhaps even weeks, and then boom. I 
don't have equipment to setup a test system, and my server is located some 50km 
from home, so I need something that works, not part of the time, but all the 
time. Due to this, I'd recommend against VirtualBox on OpenSolaris. 

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
r...@karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Miles Nordin
 et == Erik Trimble erik.trim...@oracle.com writes:

et With NFS-hosted VM disks, do the same thing: create a single
et filesystem on the X4540 for each VM.

previous posters pointed out there are unreasonable hard limits in
vmware to the number of NFS mounts or iSCSI connections or something,
so you will probably run into that snag when attempting to use the
much faster snapshotting/cloning in ZFS.

 * Are the FSYNC speed issues with NFS resolved?
 
et The ZIL SSDs will compensate for synchronous write issues in
et NFS.

okay, but sometimes for VM's I think this often doesn't matter because
NFSv3 and v4 only add fsync()'s on file closings, and a virtual disk
is one giant file that the client never closes.  There may still be
synchronous writes coming through if they don't get blocked in LVM2
inside the guest or blocked in the VM software, but whatever comes
through ought to be exactly the same number of them for NFS or iSCSI,
unless the vm software has different bugs in the nfs vs iscsi
back-ends.

the other difference is in the latest comstar which runs in
sync-everything mode by default, AIUI.  Or it does use that mode only
when zvol-backed?  Or something.  I've the impression it went through
many rounds of quiet changes, both in comstar and in zvol's, on its
way to its present form.  I've heard said here you can change the mode
both from the comstar host and on the remote initiator, but I don't
know how to do it or how sticky the change is, but if you didn't
change and stuck with the default sync-everything I think NFS would be
a lot faster.  This is if we are comparing one giant .vmdk or similar
on NFS, against one zvol.  If we are comparing an exploded filesystem
on NFS mounted through the virtual network adapter, then of course
you're right again Erik.

The tradeoff integrity tests are, (1) reboot the solaris storage host
without rebooting the vmware hosts  guests and see what happens, (2)
cord-yank the vmware host.  Both of these are probably more dangerous
than (3) command the vm software to virtual-cord-yank the guest.

 * Should I go with fiber channel, or will the 4 built-in 1Gbe
 NIC's give me enough speed?

FC has different QoS properties than Ethernet because of the buffer
credit mechanism---it can exert back-pressure all the way through the
fabric.  same with IB, which is HOL-blocking.  This is a big deal with
storage, with its large blocks of bursty writes that aren't really the
case for which TCP shines.  I would try both and compare, if you can
afford it!

je IMHO Solaris Zones with LOFS mounted ZFSs gives you the
je highest flexibility in all directions, probably the best
je performance and least resource consumption, fine grained
je resource management (CPU, memory, storage space) and less
je maintainance stress etc...

yeah zones are really awesome, especially combined with clones and
snapshots.  For once the clunky post-Unix XML crappo solaris
interfaces are actually something I appreciate a little, because lots
of their value comes from being able to do consistent repeatable
operations on them.

The problem is that the zones run Solaris instead of Linux.  BrandZ
never got far enough to, for example, run Apache under a
2.6-kernel-based distribution, so I don't find it useful for any real
work.  I do keep a CentOS 3.8 (I think?) brandz zone around, but not
for anything production---just so I can try it if I think the
new/weird version of a tool might be broken.

as for native zones, the ipkg repository, and even the jucr
repository, has two years old versions of everything---django/python,
gcc, movabletype.  Many things are missing outright, like nginx.  I'm
very disappointed that Solaris did not adopt an upstream package
system like Dragonfly did.  Gentoo or pkgsrc would have been very
smart, IMHO.  Even opencsw is based on Nick Moffitt's GAR system,
which was an old mostly-abandoned tool for building bleeding edge
Gnome on Linux.  The ancient perpetually-abandoned set of packages on
jucr and the crufty poorly-factored RPM-like spec files leave me with
little interest in contributing to jucr myself, while if Solaris had
poured the effort instead into one of these already-portable package
systems like they poured it into Mercurial after adopting that, then
I'd instead look into (a) contributing packages that I need most, and
(b) using whatever system Solaris picked on my non-Solaris systems.
This crap/marginalized build system means I need to look at a way to
host Linux under Solaris, using Solaris basically just for ZFS and
nothing else.  The alternative is to spend heaps of time re-inventing
the wheel only to end up with an environment less rich than
competitors and charge twice as much for it like joyent.

But, yeah, while working on Solaris I would never install anything in
the global zone after discovering how easy it is to work with ipkg
zones.  They are really brilliant, and unlike everyone else's attempt
at these 

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Bob Friesenhahn

On Mon, 7 Jun 2010, Miles Nordin wrote:


FC has different QoS properties than Ethernet because of the buffer
credit mechanism---it can exert back-pressure all the way through the
fabric.  same with IB, which is HOL-blocking.  This is a big deal with
storage, with its large blocks of bursty writes that aren't really the
case for which TCP shines.  I would try both and compare, if you can
afford it!


FCoE is beginning to change this, with ethernet adaptors and switches 
which support the new features.  Without the new FCoE standards, 
Ethernet can exert back pressure but only on a local-link level, and 
with long delays.  You can be sure that companies like cisco will be 
(or are) selling FCoE hardware to compete with FC SANs.  The intention 
is that ethernet will put fibre channel out of business.  We shall see 
if history repeats itself.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Richard Elling
On Jun 7, 2010, at 11:06 AM, Miles Nordin wrote:
 
 the other difference is in the latest comstar which runs in
 sync-everything mode by default, AIUI.  Or it does use that mode only
 when zvol-backed?  Or something.  

It depends on your definition of latest.  The latest OpenSolaris release
is 2009.06 which treats all Zvol-backed COMSTAR iSCSI writes as
sync. This was changed in the developer releases in summer 2009, b114.
For a release such as NexentaStor 3.0.2, which is based on b140 (+/-),
the initiator's write cache enable/disable request is respected, by default.

 * Should I go with fiber channel, or will the 4 built-in 1Gbe
 NIC's give me enough speed?
 
 FC has different QoS properties than Ethernet because of the buffer
 credit mechanism---it can exert back-pressure all the way through the
 fabric.  same with IB, which is HOL-blocking.  This is a big deal with
 storage, with its large blocks of bursty writes that aren't really the
 case for which TCP shines. 

Please don't confuse Ethernet with IP. Ethernet has no routing and
no back-off other than that required for the link. Since GbE and higher
speeds are all implemented as switched fabrics, the ability of the switch
to manage contention is paramount.  You can observe this on a Solaris
system by looking at the NIC flow control kstats.

For a LAN environment, there is little practical difference between 
Ethernet and FC wrt port contention -- high quality switches will prove
better than bargain-basement switches, with direct attach (no switches)
being the optimum cost+performance solution.  WANs are a different 
beast,  and is where we find tuning the FC buffer credits to be worth the 
effort.  For WANs no tuning is required for IP on modern OSes (Ethernet
doesn't do WAN).

  I would try both and compare, if you can
 afford it!

+1
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Garrett D'Amore
On Mon, 2010-06-07 at 13:32 -0700, Richard Elling wrote:
 On Jun 7, 2010, at 11:06 AM, Miles Nordin wrote:
  
  the other difference is in the latest comstar which runs in
  sync-everything mode by default, AIUI.  Or it does use that mode only
  when zvol-backed?  Or something.  
 
 It depends on your definition of latest.  The latest OpenSolaris release
 is 2009.06 which treats all Zvol-backed COMSTAR iSCSI writes as
 sync. This was changed in the developer releases in summer 2009, b114.
 For a release such as NexentaStor 3.0.2, which is based on b140 (+/-),
 the initiator's write cache enable/disable request is respected, by default.
 

Minor Correction: NexentaStor 3.0.2 is based on 134, plus a backport
of a number of selected patches from OpenSolaris -- especially ZFS
patches.

-- Garrett


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Ross Walker
On Jun 7, 2010, at 2:10 AM, Erik Trimble erik.trim...@oracle.com  
wrote:



Comments in-line.


On 6/6/2010 9:16 PM, Ken wrote:


I'm looking at VMWare, ESXi 4, but I'll take any advice offered.

On Sun, Jun 6, 2010 at 19:40, Erik Trimble  
erik.trim...@oracle.com wrote:

On 6/6/2010 6:22 PM, Ken wrote:


Hi,

I'm looking to build a virtualized web hosting server environment  
accessing files on a hybrid storage SAN.  I was looking at using  
the Sun X-Fire x4540 with the following configuration:
6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA  
drives)

2 Intel X-25 32GB SSD's as a mirrored ZIL
4 Intel X-25 64GB SSD's as the L2ARC.
De-duplification
LZJB compression
The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:
Should I use NFS with all five VM's accessing the exports, or one  
LUN for each VM, accessed over iSCSI?


Generally speaking, it depends on your comfort level with running  
iSCSI  Volumes to put the VMs in, or serving everything out via NFS  
(hosting the VM disk file in an NFS filesystem).


If you go the iSCSI route, I would definitely go the one iSCSI  
volume per VM route - note that you can create multiple zvols per  
zpool on the X4540, so it's not limiting in any way to volume-ize a  
VM.  It's a lot simpler, easier, and allows for nicer management  
(snapshots/cloning/etc. on the X4540 side) if you go with a VM per  
iSCSI volume.


With NFS-hosted VM disks, do the same thing:  create a single  
filesystem on the X4540 for each VM.


Vmware has a 32 mount limit which may limit the OP somewhat here.


Performance-wise, I'd have to test, but I /think/ the iSCSI route  
will be faster. Even with the ZIL SSDs.


Actually properly tuned they are about the same, but VMware NFS  
datastores are FSYNC on all operations which isn't the best for data  
vmdk files, best to serve the data directly to the VM using either  
iSCSI or NFS.







Are the FSYNC speed issues with NFS resolved?


The ZIL SSDs will compensate for synchronous write issues in NFS.   
Not completely eliminate them, but you shouldn't notice issues with  
sync writing until you're up at pretty heavy loads.


You will need this with VMware as every NFS operation (not just file  
open/close) coming out of VMware will be marked FSYNC (for VM data  
integrity in the face of server failure).











If it were me (and, given what little I know of your data), I'd go  
like this:


(1) pool for VMs:
8 disks, MIRRORED
1 SSD for L2ARC
one Zvol per VM instance, served via iSCSI, each with:
DD turned ON,  Compression turned OFF

(1) pool for clients to write data to (log files, incoming data, etc.)
6 or 8 disks, MIRRORED
2 SSDs for ZIL, mirrored
Ideally, As many filesystems as you have webSITES, not just  
client VMs.  As this might be unwieldy for 100s of websites, you  
should segregate them into obvious groupings, taking care with write/ 
read permissions.

NFS served
DD OFF, Compression ON  (or OFF, if you seem to be  
having CPU overload on the X4540)


(1) pool for client read-only data
All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
All the remaining SSDs for L2ARC
As many filesystems as you have webSITES, not just client  
VMs.  (however, see above)

NFS served
DD on for selected websites (filesystems),  
Compression ON for everything


(2) Global hot spares.


Make your life easy and use NFS for VMs and data. If you need high  
performance data such as databases, use iSCSI zvols directly into the  
VM, otherwise NFS/CIFS into the VM should be good enough.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Ken
Everyone, thank you for the comments, you've given me lots of great info to
research further.

On Mon, Jun 7, 2010 at 15:57, Ross Walker rswwal...@gmail.com wrote:

 On Jun 7, 2010, at 2:10 AM, Erik Trimble erik.trim...@oracle.com wrote:

 Comments in-line.


 On 6/6/2010 9:16 PM, Ken wrote:

 I'm looking at VMWare, ESXi 4, but I'll take any advice offered.

 On Sun, Jun 6, 2010 at 19:40, Erik Trimble  erik.trim...@oracle.com
 erik.trim...@oracle.com wrote:

  On 6/6/2010 6:22 PM, Ken wrote:

 Hi,

  I'm looking to build a virtualized web hosting server environment
 accessing files on a hybrid storage SAN.  I was looking at using the Sun
 X-Fire x4540 with the following configuration:

- 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA
drives)
- 2 Intel X-25 32GB SSD's as a mirrored ZIL
- 4 Intel X-25 64GB SSD's as the L2ARC.
- De-duplification
- LZJB compression

 The clients will be Apache web hosts serving hundreds of domains.

  I have the following questions:

- Should I use NFS with all five VM's accessing the exports, or one
LUN for each VM, accessed over iSCSI?

 Generally speaking, it depends on your comfort level with running
 iSCSI  Volumes to put the VMs in, or serving everything out via NFS (hosting
 the VM disk file in an NFS filesystem).

 If you go the iSCSI route, I would definitely go the one iSCSI volume per
 VM route - note that you can create multiple zvols per zpool on the X4540,
 so it's not limiting in any way to volume-ize a VM.  It's a lot simpler,
 easier, and allows for nicer management (snapshots/cloning/etc. on the X4540
 side) if you go with a VM per iSCSI volume.

 With NFS-hosted VM disks, do the same thing:  create a single filesystem on
 the X4540 for each VM.


 Vmware has a 32 mount limit which may limit the OP somewhat here.


 Performance-wise, I'd have to test, but I /think/ the iSCSI route will be
 faster. Even with the ZIL SSDs.


 Actually properly tuned they are about the same, but VMware NFS datastores
 are FSYNC on all operations which isn't the best for data vmdk files, best
 to serve the data directly to the VM using either iSCSI or NFS.


- Are the FSYNC speed issues with NFS resolved?

 The ZIL SSDs will compensate for synchronous write issues in NFS.
 Not completely eliminate them, but you shouldn't notice issues with sync
 writing until you're up at pretty heavy loads.


 You will need this with VMware as every NFS operation (not just file
 open/close) coming out of VMware will be marked FSYNC (for VM data integrity
 in the face of server failure).



 If it were me (and, given what little I know of your data), I'd go
 like this:

 (1) pool for VMs:
 8 disks, MIRRORED
 1 SSD for L2ARC
 one Zvol per VM instance, served via iSCSI, each with:
 DD turned ON,  Compression turned OFF

 (1) pool for clients to write data to (log files, incoming data, etc.)
 6 or 8 disks, MIRRORED
 2 SSDs for ZIL, mirrored
 Ideally, As many filesystems as you have webSITES, not just client
 VMs.  As this might be unwieldy for 100s of websites, you should segregate
 them into obvious groupings, taking care with write/read permissions.
 NFS served
 DD OFF, Compression ON  (or OFF, if you seem to be having
 CPU overload on the X4540)

 (1) pool for client read-only data
 All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
 All the remaining SSDs for L2ARC
 As many filesystems as you have webSITES, not just client VMs.
 (however, see above)
 NFS served
 DD on for selected websites (filesystems), Compression ON
 for everything

 (2) Global hot spares.


 Make your life easy and use NFS for VMs and data. If you need high
 performance data such as databases, use iSCSI zvols directly into the VM,
 otherwise NFS/CIFS into the VM should be good enough.

 -Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread David Magda

On Jun 7, 2010, at 16:32, Richard Elling wrote:

Please don't confuse Ethernet with IP. Ethernet has no routing and  
no back-off other than that required for the link.


Not entirely accurate going forward. IEEE 802.1Qau defines an end-to- 
end congestion notification management system:


http://blogs.netapp.com/ethernet/8021qau/

IEEE 802.1aq provides for a link state protocol for finding the  
topology of Ethernet network:


http://en.wikipedia.org/wiki/Shortest_Path_Bridging

See also the IETF's Transparent Interconnection of Lots of Links  
(TRILL):


http://tools.ietf.org/html/rfc5556
http://tools.ietf.org/wg/trill/

All of this is being done under the rubric of data center  
bridging (DCB):


http://en.wikipedia.org/wiki/Data_center_bridging

Brocade and IBM (?) call this Converged Enhanced Ethernet (CEE).

Things aren't what they used to was.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Homegrown Hybrid Storage

2010-06-06 Thread Ken
Hi,

I'm looking to build a virtualized web hosting server environment accessing
files on a hybrid storage SAN.  I was looking at using the Sun X-Fire x4540
with the following configuration:

   - 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA drives)
   - 2 Intel X-25 32GB SSD's as a mirrored ZIL
   - 4 Intel X-25 64GB SSD's as the L2ARC.
   - De-duplification
   - LZJB compression

The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:

   - Should I use NFS with all five VM's accessing the exports, or one LUN
   for each VM, accessed over iSCSI?
   - Are the FSYNC speed issues with NFS resolved?
   - Should I go with fiber channel, or will the 4 built-in 1Gbe NIC's give
   me enough speed?
   - How many SSD's should I use for the ZIL and L2ARC?
   - What pool structure should I use?

I know these questions are slightly vague, but any input would be greatly
appreciated.

Thanks!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-06 Thread Erik Trimble

On 6/6/2010 6:22 PM, Ken wrote:

Hi,

I'm looking to build a virtualized web hosting server environment 
accessing files on a hybrid storage SAN.  I was looking at using the 
Sun X-Fire x4540 with the following configuration:


* 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA
  drives)
* 2 Intel X-25 32GB SSD's as a mirrored ZIL
* 4 Intel X-25 64GB SSD's as the L2ARC.
* De-duplification
* LZJB compression

The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:

* Should I use NFS with all five VM's accessing the exports, or
  one LUN for each VM, accessed over iSCSI?
* Are the FSYNC speed issues with NFS resolved?
* Should I go with fiber channel, or will the 4 built-in 1Gbe
  NIC's give me enough speed?
* How many SSD's should I use for the ZIL and L2ARC?
* What pool structure should I use?

I know these questions are slightly vague, but any input would be 
greatly appreciated.


Thanks!



Which Virtual Machine technology are you going to use?

VirtualBox
VMWare
Xen
Solaris Zones
Somethinge else...

It will make a difference as to my recommendation (or, do you want me to 
recommend a VM type, too?)


grin



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss