> From: Deano [mailto:de...@rattie.demon.co.uk]
> 
> Hi Edward,
> Do you have a source for the 8KiB block size data? whilst we can't avoid
the
> SSD controller in theory we can change the smallest size we present to the
> SSD to 8KiB fairly easily... I wonder if that would help the controller do
a
> better job (especially with TRIM)
> 
> I might have to do some test, so far the assumption (even inside sun's sd
> driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
> should have an 8KiB option...

It's hard to say precisely where the truth lies, so I'll just tell a story
and take from it what you will.

For me, it started when I started deploying new laptops with SSD's.  There
was a problem with the backup software, so I kept reimaging machines using
"dd" and then backing up and restoring with acronis, and when it failed, I
would restore again via dd, etc etc etc.  So I kept overwriting the drive
repeatedly.  After only 2-3 iterations, the performance degraded to around
50% of its original speed.

At work, we have a team of engineers who know flash intimately.  So I asked
them about flash performance degrading with usage.  The first response was
that each time it's erased and rewritten, the data isn't written as clearly
as before.  Like erasing pencil or chalkboard and rewriting over and over.
It becomes "smudgy."  So with repetition and age, the device becomes slower
and consumes more power, because there's a higher incidence of errors and
higher requirement for error correction and repeating the operations with
varying operating parameters on the chips.  All of this is invisible to the
OS but affects performance internally.  But then I said I was getting 50%
loss after only 2-3 iterations, so this life degradation became clearly not
the issue.  This life degradation issue will become significant after tens
of thousands, or higher number of iterations.

They suggested the cause of the problem must be caused by something in the
controller, not in the flash itself.

So I kept working on it.  I found this:
http://www.pcper.com/article.php?aid=669&type=expert (see the section on
Write Combining)
Rather than reading that whole article ... The most valuable thing to come
out of it is to identify useful search terms.  The following are useful
search terms:

ssd "write combining"
ssd internal fragmentation
ssd sector remapping

This is very similar to ZFS write aggregation.  They're combining small
writes into larger blocks and taking advantage of block remapping to keep
track of it all.  You gain performance during lots of small writes.  It does
not hurt you for lots of random small reads.  But it does hurt you for
sequential reads/writes that happen after the remapping.  Also, unlike ZFS,
the drive can't fully recover after the fact, when data gets deleted or
moved or overwritten, etc.  Unlike ZFS, the drive doesn't have any way to
straighten itself out, except TRIM.

After discovering this, I went back to the flash guys at work, and explained
the internal fragmentation idea.  One of the head engineers was there at the
time, and he's the one who told me flash is made in 8k pages.  "To flash
manufacturers, SSD's are the pimple on the butt of the elephant" was his
statement.  Unfortunately, hard disks and OSes historically both used 512b
sectors.  Then hard drives started using 4k sectors but to maintain
compatibility with OSes, they still emulate 512b on the interface.  But the
OS assumes the disk is doing this, so the OS aligns 512b writes to multiples
of every 4k in order to avoid the read/modify/write.  Unfortunately, now the
SSD's are using 8k physical sector size, and emulating god knows what (4k or
512b) on the interface, so the RMW is once again necessary until the OSes
become aware and start aligning on 8k pages instead...  But then that
doesn't even matter anymore either, thanks to sector remapping and write
combining, even if your OS is intelligent enough, you're still going to end
up with fragmentation anyway.  Unless the OS pads every write to make up a
full 8k page.

But getting back to the point.  The question I think you're asking, is to
verify the existence of the 8k physical page inside the SSD.

There are two ways to prove it that I can think of:  (a) rip apart your SSD
and hope you can read chip numbers and hope you can find specs of those
chips to confirm or deny the 8k pages.  or (b) TRIM your entire drive and
see if it returns to original performance afterward.  This can be done via
hdderase, but that requires changing temporarily into ATA mode, booting from
a DOS disk, and then putting it back into AHCI mode afterward...  I went as
far as putting into ATA mode, but then I found it was going to be a rathole
for me to create the DOS disk, so I decided to call it quits and assume I
had the right answer with a high enough degree of confidence.  Since
performance is only degraded for sequential operations, I will see
degradation for OS rebuilds, but users probably won't notice.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to