Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-29 Thread Edward Ned Harvey
 From: Deano [mailto:de...@rattie.demon.co.uk]
 
 Hi Edward,
 Do you have a source for the 8KiB block size data? whilst we can't avoid
the
 SSD controller in theory we can change the smallest size we present to the
 SSD to 8KiB fairly easily... I wonder if that would help the controller do
a
 better job (especially with TRIM)
 
 I might have to do some test, so far the assumption (even inside sun's sd
 driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
 should have an 8KiB option...

It's hard to say precisely where the truth lies, so I'll just tell a story
and take from it what you will.

For me, it started when I started deploying new laptops with SSD's.  There
was a problem with the backup software, so I kept reimaging machines using
dd and then backing up and restoring with acronis, and when it failed, I
would restore again via dd, etc etc etc.  So I kept overwriting the drive
repeatedly.  After only 2-3 iterations, the performance degraded to around
50% of its original speed.

At work, we have a team of engineers who know flash intimately.  So I asked
them about flash performance degrading with usage.  The first response was
that each time it's erased and rewritten, the data isn't written as clearly
as before.  Like erasing pencil or chalkboard and rewriting over and over.
It becomes smudgy.  So with repetition and age, the device becomes slower
and consumes more power, because there's a higher incidence of errors and
higher requirement for error correction and repeating the operations with
varying operating parameters on the chips.  All of this is invisible to the
OS but affects performance internally.  But then I said I was getting 50%
loss after only 2-3 iterations, so this life degradation became clearly not
the issue.  This life degradation issue will become significant after tens
of thousands, or higher number of iterations.

They suggested the cause of the problem must be caused by something in the
controller, not in the flash itself.

So I kept working on it.  I found this:
http://www.pcper.com/article.php?aid=669type=expert (see the section on
Write Combining)
Rather than reading that whole article ... The most valuable thing to come
out of it is to identify useful search terms.  The following are useful
search terms:

ssd write combining
ssd internal fragmentation
ssd sector remapping

This is very similar to ZFS write aggregation.  They're combining small
writes into larger blocks and taking advantage of block remapping to keep
track of it all.  You gain performance during lots of small writes.  It does
not hurt you for lots of random small reads.  But it does hurt you for
sequential reads/writes that happen after the remapping.  Also, unlike ZFS,
the drive can't fully recover after the fact, when data gets deleted or
moved or overwritten, etc.  Unlike ZFS, the drive doesn't have any way to
straighten itself out, except TRIM.

After discovering this, I went back to the flash guys at work, and explained
the internal fragmentation idea.  One of the head engineers was there at the
time, and he's the one who told me flash is made in 8k pages.  To flash
manufacturers, SSD's are the pimple on the butt of the elephant was his
statement.  Unfortunately, hard disks and OSes historically both used 512b
sectors.  Then hard drives started using 4k sectors but to maintain
compatibility with OSes, they still emulate 512b on the interface.  But the
OS assumes the disk is doing this, so the OS aligns 512b writes to multiples
of every 4k in order to avoid the read/modify/write.  Unfortunately, now the
SSD's are using 8k physical sector size, and emulating god knows what (4k or
512b) on the interface, so the RMW is once again necessary until the OSes
become aware and start aligning on 8k pages instead...  But then that
doesn't even matter anymore either, thanks to sector remapping and write
combining, even if your OS is intelligent enough, you're still going to end
up with fragmentation anyway.  Unless the OS pads every write to make up a
full 8k page.

But getting back to the point.  The question I think you're asking, is to
verify the existence of the 8k physical page inside the SSD.

There are two ways to prove it that I can think of:  (a) rip apart your SSD
and hope you can read chip numbers and hope you can find specs of those
chips to confirm or deny the 8k pages.  or (b) TRIM your entire drive and
see if it returns to original performance afterward.  This can be done via
hdderase, but that requires changing temporarily into ATA mode, booting from
a DOS disk, and then putting it back into AHCI mode afterward...  I went as
far as putting into ATA mode, but then I found it was going to be a rathole
for me to create the DOS disk, so I decided to call it quits and assume I
had the right answer with a high enough degree of confidence.  Since
performance is only degraded for sequential operations, I will see
degradation for OS rebuilds, but users probably won't notice.


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Eff Norwood
 
 We tried all combinations of OCZ SSDs including their PCI based SSDs and
 they do NOT work as a ZIL. After a very short time performance degrades
 horribly and for the OCZ drives they eventually fail completely. 

This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they're not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM'd.  Pt...  But in my experience, reality is about 50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I'm not sure at what point ZFS added TRIM,
or to what extent...  Can't really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It's mostly a question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Deano
Hi Edward,
Do you have a source for the 8KiB block size data? whilst we can't avoid the
SSD controller in theory we can change the smallest size we present to the
SSD to 8KiB fairly easily... I wonder if that would help the controller do a
better job (especially with TRIM)

I might have to do some test, so far the assumption (even inside sun's sd
driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
should have an 8KiB option...

Thanks,
Deano
de...@cloudpixies.com

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: 28 January 2011 13:25
To: 'Eff Norwood'; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller
BB Write Cache

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Eff Norwood
 
 We tried all combinations of OCZ SSDs including their PCI based SSDs and
 they do NOT work as a ZIL. After a very short time performance degrades
 horribly and for the OCZ drives they eventually fail completely. 

This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they're not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM'd.  Pt...  But in my experience, reality is about 50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I'm not sure at what point ZFS added TRIM,
or to what extent...  Can't really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It's mostly a question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread taemun
Comments below.

On 29 January 2011 00:25, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 This was something interesting I found recently.  Apparently for flash
 manufacturers, flash hard drives are like the pimple on the butt of the
 elephant. A vast majority of the flash production in the world goes into
 devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
 into hard drives.

http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business
~6.1 percent for 2010, from that estimate (first thing that Google turned
up). Not denying what you said, I just like real figures rather than random
hearsay.


 As a result, they optimize for these other devices, and
 one of the important side effects is that standard flash chips use an 8K
 page size.  But hard drives use either 4K or 512B.

http://www.anandtech.com/Show/Index/2738?cPage=19all=Falsesort=0page=5
Terms: page means the smallest data size that can be read or programmed
(written). Block means the smallest data size that can be erased. SSDs
commonly have a page size of 4KiB and a block size of 512KiB. I'd take
Anandtech's word on it.

There is probably some variance across the market, but for the vast
majority, this is true. Wikipedia's
http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common
page sizes are 512B, 2KiB, and 4KiB.

The SSD controller secretly remaps blocks internally, and aggregates small
 writes into a single 8K write, so there's really no way for the OS to know
 if it's writing to a 4K block which happens to be shared with another 4K
 block in the 8K page.  So it's unavoidable, and whenever it happens, the
 drive can't simply write.  It must read modify write, which is obviously
 much slower.

This is be true, but for 512B to 4KiB aggregation, as the 8KiB page doesn't
exist. As for writing when everything is full, and you need to do an
erase. well this is where TRIM is helpful.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
 throughput...  They lie.  Well, technically they're not lying because
 technically it is *possible* to reach whatever they say.  Optimize your
 usage patterns and only use blank drives which are new from box, or have
 been fully TRIM'd.  Pt...  But in my experience, reality is about 50%
 of
 whatever they say.

 Presently, the only way to deal with all this is via the TRIM command,
 which
 cannot eliminate the read/modify/write, but can reduce their occurrence.
 Make sure your OS supports TRIM.  I'm not sure at what point ZFS added
 TRIM,
 or to what extent...  Can't really measure the effectiveness myself.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655

 Long story short, in the real world, you can expect the DDRDrive to crush
 and shame the performance of any SSD you can find.  It's mostly a question
 of PCIe slot versus SAS/SATA slot, and other characteristics you might care
 about, like external power, etc.

Sure, DDR RAM will have a much quicker sync write time. This isn't really a
surprising result.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Eric D. Mudama

On Fri, Jan 28 at  8:25, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Eff Norwood

We tried all combinations of OCZ SSDs including their PCI based SSDs and
they do NOT work as a ZIL. After a very short time performance degrades
horribly and for the OCZ drives they eventually fail completely.


This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.


The reality is way more complicated, and statements like the above may
or may not be true on a vendor-by-vendor basis.

As time passes, the underlying NAND geometries are designed for
certain sets of advantages, continually subject to re-evaluation and
modification, and good SSD controllers on the top of NAND or other
solid-state storage will map those advantages effectively into our
problem domains as users.

Testing methodologies are improving over time as well, and eventually
it will be more clear which devices are suited to which tasks.

The suitability of a specific solution into a problem space will
always be a balance between cost, performance, reliability and time to
market.  No single solution (RAM SAN, RAM SSD, NAND SSD, BBU
controllers, rotating HDD, etc.) wins in every single area, or else we
wouldn't be having this discussion.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-27 Thread Eff Norwood
We tried all combinations of OCZ SSDs including their PCI based SSDs and they 
do NOT work as a ZIL. After a very short time performance degrades horribly and 
for the OCZ drives they eventually fail completely. We also tried Intel which 
performed a little better and didn't flat out fail over time, but these still 
did not work out as a ZIL. We use the DDRdrive X1 now for all of our ZIL 
applications and could not be happier. The cards are great, support is great 
and performance is incredible. We use them to provide NFS storage to 50K VMWare 
VDI users. As you stated, the DDRdrive is ideal. Go with that and you'll be 
very happy you did!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-27 Thread James
Chris  Eff,
Thanks for your expertise on this and other posts.  Greatly appreciated.  I've 
just been re-reading some of the great SSD-as-ZIL discussions.

Chris,
Cost:  Our case is a bit non-representative as we have spare P410/512's that 
came with ESXi hosts (USB boot) so I've budgetted them at £0.  I will be in 
touch for a quote, I just want to get all my theory straight first on the 
options.
Benchmarks:   Good point on graph direction and I look forward to seeing any 
further papers.
Latency:   Yes the 9.9ms avg latency (pg 49) was what initially got me thinking 
about adding the BBWC in front.Thanks for reviewing that theory.Good to 
know it's an option.

Eff,
Thanks for the Vertex review.  Very helpful. Do you use mirror'd DDRDrives 
(or have you so much confidence in them you risk single devices?).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-27 Thread Eff Norwood
They have been incredibly reliable with zero downtime or issues. As a result, 
we use 2 in every system striped. For one application outside of VDI, we use a 
pair of them mirrored, but that is very unusual and driven by the customer and 
not us.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-26 Thread James
I’m wondering if any of the ZIL gurus could examine the following and point out 
anywhere my logic is going wrong.  

For small backend systems (e.g. 24x10k SAS Raid 10) I’m expecting an absolute 
maximum backend write throughput of 1 seq IOPS** and more realistically 
2000-5000. With small (4kB) blocksizes*, 10k is 480MB over 10s so we don’t 
need much ZIL space or throughput.  What we do need is the ability to absorb 
the IOPS at low latency and keep absorbing them at least as fast as the backend 
storage can commit them.   

ZIL OPTIONS:   Obviously a DDRDrive is the ideal (36k 4k random IOPS***) but 
for the same budget I can get 2x Vertex 2 EX 50GB drives and put each behind 
it’s own P410 512MB BBWC controller.Assuming the SSDs can do 6300 4k random 
IOPS*** and that the controller cache confirms those writes in the same latency 
as the DDRDrive (both PCIe attached RAM?) then we should have DDRDrive type 
latency up to 6300 sustained IOPS.Also, in bursting traffic, we should be 
able to absorb up to 512MB of data (3.5s of 36000 4k IOPS)  at much higher 
IOPS/low latency as long as averages at 6300 (ie SSD can empty cache before 
fills). 

So what are the issues with using this approach for low budget builds looking 
for mirrored ZILs that don’t require 6300 sustained write IOPS (due to backend 
disk limitations?).   Obviously there’s a lot of assumptions here but wanted to 
get my theory straight before I start ordering things to test.

Thanks all.
James

* For NTFS 4kB clusters on VMWare / NFS, I believe 4kB zfs recordsize will 
provide best performance (avoid partial writes).  Thoughts welcome on that too.
** Assumes 10k SAS can do max 900 sequential writes each striped across 12 
mirrors and rounded down (900 based on TomsHardware hdd streaming write bench). 
  Also assumes ZFS can take completely random writes and turn them into 
completely sequential write iops on underlying disks and that no reads,32k 
writes etc are hitting disk at the same time. Realistically 2000-5000 is 
probably more likely maximums.
*** Figures from excellent DDRDrive presentation.  NB: If BBWC can 
sequentialise writes to SSD may get closer to 1 IOPS
 I’m assuming that P410 BBWC and DDRDrive have similar IOPS/latency profile 
– DDRDrive may do something fancy with striping across RAM to improve IO?

Similar Posts:
http://opensolaris.org/jive/thread.jspa?messageID=460871  - except normal disks 
instead of ssd behind cache (so cache would fill).
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg39729.html - same 
again
 
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-26 Thread Christopher George
 ZIL OPTIONS: Obviously a DDRdrive is the ideal (36k 4k random 
 IOPS***) but for the same budget I can get 2x Vertex 2 EX 50GB 
 drives and put each behind it’s own P410 512MB BBWC controller.

The Vertex 2 EX goes for approximately $900 each online, while the 
P410/512 BBWC is listed at HP for $449 each.  Cost wise you should 
contact us for a quote, as we are price competitive with just a single 
SSD/HBA combination.  Especially, as one obtains 4GB instead of 
512MB of ZIL accelerator capacity.

 Assuming the SSDs can do 6300 4k random IOPS*** and that the 
 controller cache confirms those writes in the same latency as the 

For 4KB random writes you need to look closely at slides 47/48 of the 
referenced presentation (http://www.ddrdrive.com/zil_accelerator).

The 6443 IOPS is obtained after testing for *only* 2 hours post 
unpackaging or secure erase.  The slope of both curves gives a hint, as 
the Vertex 2 EX does not level off and will continue to decrease.  I am 
working on a new presentation focusing on this very fact for random 
write IOPS performance over time (life of the device).  Suffice to say, 
6443 IOPS is *not* worst case performance for random writes on the 
Vertex 2 EX.

 DDRdrive (both PCIe attached RAM?) then we should have 
 DDRdrive type latency up to 6300 sustained IOPS.

All tests used a QD (Queue Depth) of 32 which will hide the device 
latency of a single IO.  Very meaningful, as real life workloads can 
be bound by even a single outstanding IO.  Let's trace the latency to
determine which has the advantage.  For the SSD/HBA combination 
an IO has to run the gauntlet through two controllers (HBA and SSD)
and propagate over a SATA cable.  The DDRdrive X1 has a single 
unified controller and no extraneous SATA cable, see slides 15-17.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss