Eric D. Mudama wrote:
On Fri, Jan 1 at 21:21, Erik Trimble wrote:
That all said, it certainly would be really nice to get a SSD
controller which can really push the bandwidth, and the only way I
see this happening now is to go the "stupid" route, and dumb down the
controller as much as possible. I really think we just want the
controller to Do What I Say, and not try any optimizations or such.
There's simply much more benefit to doing the optimization up at the
filesystem level than down at the device level. For a trivial case,
consider the dreaded "write-read-write" problem of MLCs: to write a
single bit, a whole page has to be read, then the page recomposed
with the changed bits, before writing again. If the filesystem was
aware that the drive had this kind of issue, then in-RAM caching
would almost always allow for the avoidance of the first "read"
cycle, and performance goes back to a typical Copy-on-Write style
stripe write.
I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support. This hardware support is, of course, the controller that
exists on modern SSDs.
Why not? My argument is the one that ZFS as a whole is founded on:
that modern CPUs have so many spare cycles that it's silly to pay extra
for a smart raid controller when we can just borrow time on the main
CPU. It seems to work out just fine for hard drives, so why not for
SSDs (which, while much faster than HDs, are still many orders of
magnitude slower than DMA transfers)?
Drive vendors abstracted these interfaces a long time ago, creating
Integrated Drive Electronics (IDE). Bringing all of that logic back
up into the CPU would likely not help meaningfully. Yes, it would
likely be cheaper, but I doubt it would be faster or more reliable.
I'm not advocating a return to something like the old IPI technology (oh
boy, did I just date myself there...). That's silly. By "dumb", I'm
referring to things on the level with IDE - a disk controller that
handles nothing more than internal (to the disk) bad block remapping,
LBA to actual block mapping, etc. In the case of a "stupid" SSD
controller, that would entail sufficient smarts to do wear leveling,
LBA, bad page/block marking/detection, and very little else.
I also am not convinced that your described RMW semantics are used in
any modern NAND devices. Those problems were solved years ago. The
granularity of the implementation has implications on performance in
some workloads, but I believe only those old JMicron-based SSDs did
block-level RMW, and hence wound up doing about ~2-3 IOPS in random
workloads with MLC drives.
ALL modern MLC-based SSDs have exactly the problem I've described as the
example above. It's a defining characteristic of the Multi-Level Cell
design. A nice modern Intel X25-M can see a loss of 50-80% of it's
theoretical maximum write performance once it runs out of unused cells
to write to. And that's with the fancy firmware.
SSDs (with good controllers) really strut their stuff when in-RAM
caching isn't working anyway. If in-RAM was good enough, then why
bother with SSD? Just have a spun-down rotating drive at 1/5th-1/15th
the cost.
--eric
No, they don't (strut, that is). MLC-based SSDs (and, even SLC-based
ones, to a lesser extent) have a very significant write penalty. Much
of the "smarts" that does into current-gen SSDs is an attempt to
overcome this design limitation. What Bob and I are saying is that
locating the "smarts" in the SSD controller is misguided. Having this
intelligence located in the OS/Filesystem driver is a far better idea,
as the system has a much more global understanding as to where
optimizations can occur, and make the appropriate choices. And,
frankly, it's far easier to update a filesystem driver than it is to
reflash firmware on an SSD, should any changes be necessary.
The example I was giving for R-M-W is that it is /highly/ likely that
the OS already has a significant bunch of a file to be modified ALREADY
in the buffer cache (L2ARC, in ZFS's case). So, if ZFS is talking to a
stupid SSD, it knows that it cannot just issue a single block write
should a bit in the file change. Instead, ZFS will know that it should
issue a TRIM command (or something like it) to have the SSD mark the old
page (where the bit(s) change) deleted, and then use that page from the
L2ARC as the template to build a full page with the new bit(s) in it,
and then issue a full page write to the SSD. This avoids having the SSD
itself having to do this. Worst case scenario is that the SSD will have
to read the whole page to get it back into the L2ARC, but typical-use
case is much higher likelihood. So, typical case is 1 IOPS vs 3 IOPS
on a "smart" SSD. My approach uses more interface (SAS/SATA/etc)
bandwidth to the SSD, but that's OK, since there's plenty to spare.
And, going back to the article where they're speculating having
something like RAID and Dedup integrated at the SSD controller-level,
this is just a /really/ bad idea.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss