Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-16 Thread Christopher George
 I mean, could I stripe across multiple devices to be able to handle higher
 throughput?

Absolutely.  Stripping four DDRdrive X1s (16GB dedicated log) is 
extremely simple.  Each X1 has it's own dedicated IOPS controller, critical 
for approaching linear synchronous write scalability.  The same principles 
and benefits of multi-core processing apply here with multiple controllers.  
The performance potential of NVRAM based SSDs dictates moving away 
from a single/separate HBA based controller. 

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Arve Paalsrud
Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s) 
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB 
L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
read/write scenario with 70/30 distribution. Now, I know that you should have 
mirrored Zil for safety, but the entire box are synchronized with an active 
standby on a different site location (18km distance - round trip of 0.16ms + 
equipment latency). So in case the Zil in Site A takes a fall, or the 
motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main pool 
are full (capacity).

Without going into details about chipsets and such, do any of you on this list 
have any experience with a similar setup and can share with us your thoughts, 
do's and dont's, and any other information that could be of help while building 
and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Erik Trimble

On 6/15/2010 4:42 AM, Arve Paalsrud wrote:

Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s)
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB 
L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
read/write scenario with 70/30 distribution. Now, I know that you should have 
mirrored Zil for safety, but the entire box are synchronized with an active 
standby on a different site location (18km distance - round trip of 0.16ms + 
equipment latency). So in case the Zil in Site A takes a fall, or the 
motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main pool 
are full (capacity).

Without going into details about chipsets and such, do any of you on this list 
have any experience with a similar setup and can share with us your thoughts, 
do's and dont's, and any other information that could be of help while building 
and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve
   



Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a 
good choice.  SLC SSDs still spank any MLC device, and random IOPS for 
something like an Intel X25-E or OCZ Vertex EX are over twice that of 
the RealSSD.  I don't know where they manage to get 40k+ IOPS number for 
the RealSSD (I know it's in the specs, but how did they get that?), but 
that's not what others are reporting:


http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7

Sadly, none of the current crop of SSDs support a capacitor or battery 
to back up their local (on-SSD) cache, so they're all subject to data 
loss on a power interruption.


Likewise, random Read dominates L2ARC usage. Here, the most 
cost-effective solutions tend to be MLC-based SSDs with more moderate 
IOPS performance - the Intel X25-M and OCZ Vertex series are likely much 
more cost-effective than a RealSSD, especially considering 
price/performance.



Also, given the limitations of a x4 port connection to the rest of the 
system, I'd consider using a couple more SAS controllers, and fewer 
Expanders. The SSDs together are likely to be able to overwhelm a x4 
PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for 
them.  For the 42 disks, it depends more on what your workload looks 
like. If it is mostly small or random I/O to the disks, you can get away 
with fewer HBAs. Large, sequential I/O to the disks is going to require 
more HBAs.  Remember, a modern 7200RPM SATA drive can pump out well over 
100MB/s sequential, but well under 10MB/s random.  Do the math to see 
how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out 
at about 2GB/s.



I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping 
isn't really going to buy you much here (so far as I can tell).  6Gbit/s 
SAS is wasted on HDs, so don't bother paying for it if you can avoid 
doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it 
at all, as really only the read performance of the L2ARC SSDs might 
possibly exceed 3Gb/s SAS.



I'm going to say something sacrilegious here:  128GB of RAM may be 
overkill.  You have the SSDs for L2ARC - much of which will be the DDT, 
but, if I'm reading this correctly, even if you switch to the 160GB 
Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only 
half is in-use by the DDT. The rest is file cache.  You'll need lots of 
RAM if you plan on storing lots of small files in the L2ARC (that is, if 
your workload is lots of small files).  200bytes/record needed in RAM 
for an L2ARC entry.


I.e.

if you have 1k average record size, for 600GB of L2ARC, you'll need  
600GB / 1kb * 200B = 120GB RAM.


if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 
15GB



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Darren J Moffat

On 15/06/2010 14:09, Erik Trimble wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,


The point of L2ARC is that you start adding L2ARC when you can no longer 
physically put in (or afford) to add any more DRAM, so if OP can afford 
to put in 128GB of RAM then they should.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Erik Trimble

On 6/15/2010 6:17 AM, Darren J Moffat wrote:

On 15/06/2010 14:09, Erik Trimble wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,


The point of L2ARC is that you start adding L2ARC when you can no 
longer physically put in (or afford) to add any more DRAM, so if OP 
can afford to put in 128GB of RAM then they should.




True.

I was speaking price/performance.  Those 8GB DIMMs are still pretty 
darned pricey...


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Roy Sigurd Karlsbakk
 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,
 but, if I'm reading this correctly, even if you switch to the 160GB
 Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
 half is in-use by the DDT. The rest is file cache. You'll need lots of
 RAM if you plan on storing lots of small files in the L2ARC (that is,
 if your workload is lots of small files). 200bytes/record needed in
 RAM for an L2ARC entry.
 
 I.e.
 
 if you have 1k average record size, for 600GB of L2ARC, you'll need
 600GB / 1kb * 200B = 120GB RAM.
 
 if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
 = 15GB

Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
entry. Later, someone else told med 270. Then you, at 200. Also, there should 
be a good way to list out a total of blocks (zdb just crashed with a full 
memory on my 10TB test box). I tried browsing the source to see the size of the 
ddt struct, but I got lost. Can someone with an osol development environment 
please just check sizeof that struct?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Erik Trimble

On 6/15/2010 6:40 AM, Roy Sigurd Karlsbakk wrote:

I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,
but, if I'm reading this correctly, even if you switch to the 160GB
Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
half is in-use by the DDT. The rest is file cache. You'll need lots of
RAM if you plan on storing lots of small files in the L2ARC (that is,
if your workload is lots of small files). 200bytes/record needed in
RAM for an L2ARC entry.

I.e.

if you have 1k average record size, for 600GB of L2ARC, you'll need
600GB / 1kb * 200B = 120GB RAM.

if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
= 15GB
 

Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
entry. Later, someone else told med 270. Then you, at 200. Also, there should 
be a good way to list out a total of blocks (zdb just crashed with a full 
memory on my 10TB test box). I tried browsing the source to see the size of the 
ddt struct, but I got lost. Can someone with an osol development environment 
please just check sizeof that struct?

Vennlige hilsener / Best regards

roy
--
   


A DDT entry takes up about 250 bytes, regardless of where it is stored.

For every normal (i.e. block, metadata, etc - NOT DDT ) L2ARC entry, 
about 200 bytes has to be stored in main memory (ARC).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Richard Elling
On Jun 15, 2010, at 6:40 AM, Roy Sigurd Karlsbakk wrote:

 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,
 but, if I'm reading this correctly, even if you switch to the 160GB
 Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
 half is in-use by the DDT. The rest is file cache. You'll need lots of
 RAM if you plan on storing lots of small files in the L2ARC (that is,
 if your workload is lots of small files). 200bytes/record needed in
 RAM for an L2ARC entry.
 
 I.e.
 
 if you have 1k average record size, for 600GB of L2ARC, you'll need
 600GB / 1kb * 200B = 120GB RAM.
 
 if you have a more manageable 8k record size, then, 600GB / 8kB * 200B
 = 15GB
 
 Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT 
 entry. Later, someone else told med 270. Then you, at 200. Also, there should 
 be a good way to list out a total of blocks (zdb just crashed with a full 
 memory on my 10TB test box). I tried browsing the source to see the size of 
 the ddt struct, but I got lost. Can someone with an osol development 
 environment please just check sizeof that struct?

Why read source when you can read the output of zdb -D? :-)
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Richard Elling
On Jun 15, 2010, at 4:42 AM, Arve Paalsrud wrote:
 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using 
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s) 
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs
 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare), 
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
 read/write scenario with 70/30 distribution. Now, I know that you should have 
 mirrored Zil for safety, but the entire box are synchronized with an active 
 standby on a different site location (18km distance - round trip of 0.16ms + 
 equipment latency). So in case the Zil in Site A takes a fall, or the 
 motherboard/disk group/motherboard dies - we still have safety.
 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main 
 pool are full (capacity).
 
 Without going into details about chipsets and such, do any of you on this 
 list have any experience with a similar setup and can share with us your 
 thoughts, do's and dont's, and any other information that could be of help 
 while building and configuring this?
 
 What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
 InfiniBand-based), with both dedupe and compression enabled in ZFS.

In general, both dedup and compression gain space by trading off performance.
You should take a closer look at snapshots + clones because they gain
performance by trading off systems management.

You can't size by ESX server, because ESX works (mostly) as a pass-through
of the client VM workload. In your sizing calculations, think of ESX as a fancy
network switch.
 -- richard

-- 
Richard Elling
rich...@nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore
On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote:
 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using 
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s) 
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs

Just recognize that those NICs are IB only.  Solaris currently does
not support 10GbE using Mellanox products, even though other operating
systems do.  (There are folks working on resolving this, but I think
we're still a couple months from seeing the results of that effort.)

 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare), 
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
 read/write scenario with 70/30 distribution. Now, I know that you should have 
 mirrored Zil for safety, but the entire box are synchronized with an active 
 standby on a different site location (18km distance - round trip of 0.16ms + 
 equipment latency). So in case the Zil in Site A takes a fall, or the 
 motherboard/disk group/motherboard dies - we still have safety.

I expect that you need more space for L2ARC and a lot less for Zil.
Furthmore, you'd be better served by an even lower latency/higher IOPs
ZIL.  If you're going to spend this kind of cash, I think I'd recommend
at least one or two DDR Drive X1 units or something similar.  While not
very big, you don't need much to get a huge benefit from the ZIL, and I
think the vastly superior IOPS of these units will pay off in the end.

 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main 
 pool are full (capacity).

Dedup is not always a win, I think.  I'd look hard at your data and
usage to determine whether to use it.

-- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore
On Tue, 2010-06-15 at 07:36 -0700, Richard Elling wrote:
  What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters 
  (also InfiniBand-based), with both dedupe and compression enabled in ZFS.
 
 In general, both dedup and compression gain space by trading off performance.
 You should take a closer look at snapshots + clones because they gain
 performance by trading off systems management.

It depends on the usage.  Note that for some uses, compression can be a
performance *win*, because generally CPUs are fast enough that the cost
of decompression beats the cost of the larger IOs required to transfer
uncompressed data.  Of course, that assumes you have CPU cycles to
spare.

-- Garrett

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Arve Paalsrud
On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote:

 On 6/15/2010 4:42 AM, Arve Paalsrud wrote:

 Hi,

 We are currently building a storage box based on OpenSolaris/Nexenta using
 ZFS.
 Our hardware specifications are as follows:

 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s)
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs
 128 GB RAM

 This setup gives us about 40TB storage after mirror (two disks in spare),
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.

 Both L2ARC and Zil shares the same disks (striped) due to bandwidth
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
 read/write scenario with 70/30 distribution. Now, I know that you should
 have mirrored Zil for safety, but the entire box are synchronized with an
 active standby on a different site location (18km distance - round trip of
 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or
 the motherboard/disk group/motherboard dies - we still have safety.

 DDT requirements for dedupe on 16k blocks should be about 640GB when main
 pool are full (capacity).

 Without going into details about chipsets and such, do any of you on this
 list have any experience with a similar setup and can share with us your
 thoughts, do's and dont's, and any other information that could be of help
 while building and configuring this?

 What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters
 (also InfiniBand-based), with both dedupe and compression enabled in ZFS.

 Let's talk moon landings.

 Regards,
 Arve




 Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a
 good choice.  SLC SSDs still spank any MLC device, and random IOPS for
 something like an Intel X25-E or OCZ Vertex EX are over twice that of the
 RealSSD.  I don't know where they manage to get 40k+ IOPS number for the
 RealSSD (I know it's in the specs, but how did they get that?), but that's
 not what others are reporting:


 http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7


See http://www.anandtech.com/show/2944/3 and
http://www.crucial.com/pdf/Datasheets-letter_C300_RealSSD_v2-5-10_online.pdf
But I agree that we should look into using the Vertex instead.

Sadly, none of the current crop of SSDs support a capacitor or battery to
 back up their local (on-SSD) cache, so they're all subject to data loss on a
 power interruption.


Noted


 Likewise, random Read dominates L2ARC usage. Here, the most cost-effective
 solutions tend to be MLC-based SSDs with more moderate IOPS performance -
 the Intel X25-M and OCZ Vertex series are likely much more cost-effective
 than a RealSSD, especially considering price/performance.


Our other option are to use two Fusion-IO ioDrive Duo SLC/MLC or the SMLC
when available (as well as drivers for Solaris) - so the price we're
currently talking about is not an issue.


 Also, given the limitations of a x4 port connection to the rest of the
 system, I'd consider using a couple more SAS controllers, and fewer
 Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E
 connection, so I'd want at least one dedicated x4 SAS HBA just for them.
  For the 42 disks, it depends more on what your workload looks like. If it
 is mostly small or random I/O to the disks, you can get away with fewer
 HBAs. Large, sequential I/O to the disks is going to require more HBAs.
  Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s
 sequential, but well under 10MB/s random.  Do the math to see how fast it
 will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s.


We're talking about 4X SAS 6Gb/s lanes - 4800MB/s per port. See
http://www.lsi.com/DistributionSystem/AssetDocument/SCG_LSISAS2008_PB_043009.pdffor
specifications of the  LSI chip. In other words, it utilizes PCIe 2.0
8x.

I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't
 really going to buy you much here (so far as I can tell).  6Gbit/s SAS is
 wasted on HDs, so don't bother paying for it if you can avoid doing so.
 Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as
 really only the read performance of the L2ARC SSDs might possibly exceed
 3Gb/s SAS.


What about bandwidth in this scenario? Won't the ZIL be limited to the
throughput of only one X25-E? The SATA disks operates at 3Gb/s through the
SAS expanders, so no 6Gb/s there.


 I'm going to say something sacrilegious here:  128GB of RAM may be
 overkill.  You have the SSDs for L2ARC - much of which will be the DDT, but,
 if I'm reading this correctly, even if you switch to the 160GB Intel X25-M,
 that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by
 the DDT. The rest is file cache.  You'll need lots of RAM if you plan on
 storing lots of small files in the L2ARC (that is, if your workload is lots
 of small 

Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Arve Paalsrud
On Tue, Jun 15, 2010 at 4:20 PM, Erik Trimble erik.trim...@oracle.comwrote:

  On 6/15/2010 6:57 AM, Arve Paalsrud wrote:

 On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote:

 I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't
 really going to buy you much here (so far as I can tell).  6Gbit/s SAS is
 wasted on HDs, so don't bother paying for it if you can avoid doing so.
 Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as
 really only the read performance of the L2ARC SSDs might possibly exceed
 3Gb/s SAS.


  What about bandwidth in this scenario? Won't the ZIL be limited to the
 throughput of only one X25-E? The SATA disks operates at 3Gb/s through the
 SAS expanders, so no 6Gb/s there.

 Yes - though I'm not sure how the slog devices work when there is more than
 one. I *don't* think they work like the L2ARC devices, which work
 round-robin. You'd have to ask.  If they're doing a true stripe, then I
 doubt you'll get much more performance as weird as that sounds.  Also, even
 with a single X25-E, you can service a huge number of IOPS - likely more
 small IOPS than can be pushed over even an Infiniband interface.  The place
 that the Infiniband would certainly outpace the X25-E's capacity is for
 large writes, where a single 100MB write would suck up all the X25-E's
 throughput capability.


But the Intel X25-E are limited to about 200 MB/s write, regardless of IOPS.
So when throwing a lot of 16k IOPS (about 13 000) at it, it will still be
limited to 200 MB/s - or about 6-7% of the throughput of an QDR InfiniBand
links capacity.

So I hereby officially ask: Can I have multiple slogs striped to handle
higher bandwidth than a single device can - is that supported in ZFS?

  --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA

  - Arve
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Arve Paalsrud
On Tue, Jun 15, 2010 at 3:33 PM, Erik Trimble erik.trim...@oracle.comwrote:

 On 6/15/2010 6:17 AM, Darren J Moffat wrote:

 On 15/06/2010 14:09, Erik Trimble wrote:

 I'm going to say something sacrilegious here: 128GB of RAM may be
 overkill. You have the SSDs for L2ARC - much of which will be the DDT,


 The point of L2ARC is that you start adding L2ARC when you can no longer
 physically put in (or afford) to add any more DRAM, so if OP can afford to
 put in 128GB of RAM then they should.


 True.

 I was speaking price/performance.  Those 8GB DIMMs are still pretty darned
 pricey...


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA


The motherboard has 32 DIMM slots - making use of 32 4GB modules to gain
128GB quite affordable :)

-Arve
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Arve Paalsrud


 -Original Message-
 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 Sent: 15. juni 2010 17:43
 To: Arve Paalsrud
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
 
 On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote:
  Hi,
 
  We are currently building a storage box based on OpenSolaris/Nexenta
 using ZFS.
  Our hardware specifications are as follows:
 
  Quad AMD G34 12-core 2.3 GHz (~110 GHz)
  10 Crucial RealSSD (6Gb/s)
  42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
  LSI2008SAS (two 4x ports)
  Mellanox InfiniBand 40 Gbit NICs
 
 Just recognize that those NICs are IB only.  Solaris currently does
 not support 10GbE using Mellanox products, even though other operating
 systems do.  (There are folks working on resolving this, but I think
 we're still a couple months from seeing the results of that effort.)
 
  128 GB RAM
 
  This setup gives us about 40TB storage after mirror (two disks in
 spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
  Both L2ARC and Zil shares the same disks (striped) due to bandwidth
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on
 4k read/write scenario with 70/30 distribution. Now, I know that you
 should have mirrored Zil for safety, but the entire box are
 synchronized with an active standby on a different site location (18km
 distance - round trip of 0.16ms + equipment latency). So in case the
 Zil in Site A takes a fall, or the motherboard/disk group/motherboard
 dies - we still have safety.
 
 I expect that you need more space for L2ARC and a lot less for Zil.
 Furthmore, you'd be better served by an even lower latency/higher IOPs
 ZIL.  If you're going to spend this kind of cash, I think I'd recommend
 at least one or two DDR Drive X1 units or something similar.  While not
 very big, you don't need much to get a huge benefit from the ZIL, and I
 think the vastly superior IOPS of these units will pay off in the end.


What about the ZIL bandwidth in this case? I mean, could I stripe across 
multiple devices to be able to handle higher throughput? Otherwise I would 
still be limited to the performance of the unit itself (155 MB/s).
 
 
  DDT requirements for dedupe on 16k blocks should be about 640GB when
 main pool are full (capacity).
 
 Dedup is not always a win, I think.  I'd look hard at your data and
 usage to determine whether to use it.
 
   -- Garrett

-Arve

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Garrett D'Amore
On Tue, 2010-06-15 at 18:33 +0200, Arve Paalsrud wrote:

 
 What about the ZIL bandwidth in this case? I mean, could I stripe across 
 multiple devices to be able to handle higher throughput? Otherwise I would 
 still be limited to the performance of the unit itself (155 MB/s).
  

I think so.  Btw, I've gotten better performance than that with my
driver (not sure about the production driver).  I seem to recall about
220 MB/sec.  (I was basically driving the PCIe x1 bus to its limit.)
This was with large transfers (sized at 64k IIRC.)  Shrinking the job
size down, I could get up to 150K IOPS with 512 byte jobs.  (This high
IOP rate is unrealistic for ZFS -- for ZFS the bus bandwidth limitation
comes into play long before you start hitting IOPS limitations.)

One issue of course is that each of these units occupies a PCIe x1 slot.

On another note, if you're dataset and usage requirements don't require
strict I/O flush/sync guarantees, you could probably get away without
any ZIL at all, and just use lots of RAM to get really good performance.
(You'd then disable the zil on filesystems that didn't have this need.
This is a very new feature in OpenSolaris.)  Of course, you don't want
to do this for data sets where loss of the data would be tragic.  (But
its ideal for situations such as filesystems used for compiling, etc. --
where the data being written can be easily regenerated in the event of a
failure.)

-- Garrett


  
   DDT requirements for dedupe on 16k blocks should be about 640GB when
  main pool are full (capacity).
  
  Dedup is not always a win, I think.  I'd look hard at your data and
  usage to determine whether to use it.
  
  -- Garrett
 
 -Arve
 
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)

2010-06-15 Thread Przemyslaw Ceglowski
On 15/06/2010 12:42, Arve Paalsrud arve.paals...@gmail.com wrote:

 Hi,
 
 We are currently building a storage box based on OpenSolaris/Nexenta using
 ZFS.
 Our hardware specifications are as follows:
 
 Quad AMD G34 12-core 2.3 GHz (~110 GHz)
 10 Crucial RealSSD (6Gb/s)
 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
 LSI2008SAS (two 4x ports)
 Mellanox InfiniBand 40 Gbit NICs

I was told that IB support in Nexenta is scheduled to be released in 3.0.4
(beginning of July).

 128 GB RAM
 
 This setup gives us about 40TB storage after mirror (two disks in spare),
 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box.
 
 Both L2ARC and Zil shares the same disks (striped) due to bandwidth
 requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
 read/write scenario with 70/30 distribution. Now, I know that you should have
 mirrored Zil for safety, but the entire box are synchronized with an active
 standby on a different site location (18km distance - round trip of 0.16ms +
 equipment latency). So in case the Zil in Site A takes a fall, or the
 motherboard/disk group/motherboard dies - we still have safety.
 
 DDT requirements for dedupe on 16k blocks should be about 640GB when main pool
 are full (capacity).
 
 Without going into details about chipsets and such, do any of you on this list
 have any experience with a similar setup and can share with us your thoughts,
 do's and dont's, and any other information that could be of help while
 building and configuring this?
 
 What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also
 InfiniBand-based), with both dedupe and compression enabled in ZFS.

As VMware does not currently support NFS over RDMA, you will need to stick
with IPoIB which will suffer from some performance implications inherent to
traditional TCP/IP stack. You could also use iSER or SRP which are both
supported.

 
 Let's talk moon landings.
 
 Regards,
 Arve

-- 
Przem

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss