Eric D. Mudama wrote:
On Fri, Jan  1 at 21:21, Erik Trimble wrote:
That all said, it certainly would be really nice to get a SSD controller which can really push the bandwidth, and the only way I see this happening now is to go the "stupid" route, and dumb down the controller as much as possible. I really think we just want the controller to Do What I Say, and not try any optimizations or such. There's simply much more benefit to doing the optimization up at the filesystem level than down at the device level. For a trivial case, consider the dreaded "write-read-write" problem of MLCs: to write a single bit, a whole page has to be read, then the page recomposed with the changed bits, before writing again. If the filesystem was aware that the drive had this kind of issue, then in-RAM caching would almost always allow for the avoidance of the first "read" cycle, and performance goes back to a typical Copy-on-Write style stripe write.

I am not convinced that a general purpose CPU, running other software
in parallel, will be able to be timely and responsive enough to
maximize bandwidth in an SSD controller without specialized hardware
support.  This hardware support is, of course, the controller that
exists on modern SSDs.
Why not? My argument is the one that ZFS as a whole is founded on: that modern CPUs have so many spare cycles that it's silly to pay extra for a smart raid controller when we can just borrow time on the main CPU. It seems to work out just fine for hard drives, so why not for SSDs (which, while much faster than HDs, are still many orders of magnitude slower than DMA transfers)?

Drive vendors abstracted these interfaces a long time ago, creating
Integrated Drive Electronics (IDE).  Bringing all of that logic back
up into the CPU would likely not help meaningfully.  Yes, it would
likely be cheaper, but I doubt it would be faster or more reliable.

I'm not advocating a return to something like the old IPI technology (oh boy, did I just date myself there...). That's silly. By "dumb", I'm referring to things on the level with IDE - a disk controller that handles nothing more than internal (to the disk) bad block remapping, LBA to actual block mapping, etc. In the case of a "stupid" SSD controller, that would entail sufficient smarts to do wear leveling, LBA, bad page/block marking/detection, and very little else.


I also am not convinced that your described RMW semantics are used in
any modern NAND devices.  Those problems were solved years ago.  The
granularity of the implementation has implications on performance in
some workloads, but I believe only those old JMicron-based SSDs did
block-level RMW, and hence wound up doing about ~2-3 IOPS in random
workloads with MLC drives.
ALL modern MLC-based SSDs have exactly the problem I've described as the example above. It's a defining characteristic of the Multi-Level Cell design. A nice modern Intel X25-M can see a loss of 50-80% of it's theoretical maximum write performance once it runs out of unused cells to write to. And that's with the fancy firmware.

SSDs (with good controllers) really strut their stuff when in-RAM
caching isn't working anyway.  If in-RAM was good enough, then why
bother with SSD?  Just have a spun-down rotating drive at 1/5th-1/15th
the cost.

--eric
No, they don't (strut, that is). MLC-based SSDs (and, even SLC-based ones, to a lesser extent) have a very significant write penalty. Much of the "smarts" that does into current-gen SSDs is an attempt to overcome this design limitation. What Bob and I are saying is that locating the "smarts" in the SSD controller is misguided. Having this intelligence located in the OS/Filesystem driver is a far better idea, as the system has a much more global understanding as to where optimizations can occur, and make the appropriate choices. And, frankly, it's far easier to update a filesystem driver than it is to reflash firmware on an SSD, should any changes be necessary.

The example I was giving for R-M-W is that it is /highly/ likely that the OS already has a significant bunch of a file to be modified ALREADY in the buffer cache (L2ARC, in ZFS's case). So, if ZFS is talking to a stupid SSD, it knows that it cannot just issue a single block write should a bit in the file change. Instead, ZFS will know that it should issue a TRIM command (or something like it) to have the SSD mark the old page (where the bit(s) change) deleted, and then use that page from the L2ARC as the template to build a full page with the new bit(s) in it, and then issue a full page write to the SSD. This avoids having the SSD itself having to do this. Worst case scenario is that the SSD will have to read the whole page to get it back into the L2ARC, but typical-use case is much higher likelihood. So, typical case is 1 IOPS vs 3 IOPS on a "smart" SSD. My approach uses more interface (SAS/SATA/etc) bandwidth to the SSD, but that's OK, since there's plenty to spare. And, going back to the article where they're speculating having something like RAID and Dedup integrated at the SSD controller-level, this is just a /really/ bad idea.
--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to