[also agree with Keith's comments] On Mar 3, 2014, at 2:35 PM, Nick <[email protected]> wrote:
> I've been testing write performance with SSDs and have a few observations and > questions. > > My focus has been on testing random write performance with 4K blocks, which > is a commonly benchmarked item. Do you actually see 4K random writes to the disk? [bait placed, waiting for the hook to set] Hint: iosnoop -Dast > The problem with 4K blocks and SSDs is that all SSDs internally use a much > larger "erase block". It seems like it is commonly 256K to 2MB in size, > depending on the drive, and the way an SSD commonly functions is that if I > write 100 4K blocks it will keep them in it's internal RAM until it gets > enough to fill an entire erase block, then it performs the actual write to > the flash memory. So whenever you see a benchmark online they are operating > the SSD in this fashion. Nit: you should qualify this as "Flash SSDs" -- other SSDs can behave differently. > > People running ZFS are generally concerned about data security, and this > would be a problem as if the power dies you will lose any data that is in the > SSD's RAM but hasn't been written to flash. Yes, very much so. > > Most newer SSDs seem to honor a sync to actually flush its memory to flash > (although some older ones will lie about a sync). All should, or they violate the spec. > > The issue, when benchmarking 4K blocks for example, is that if I tell the SSD > to write 4K of data and then do a sync, it needs to read an entire erase > block (256K+) then modify it, and write it, so all of a sudden it has written > much more data than it would have if written to in async mode. > > The first SSD I tested was an Intel S3500 Enterprise SSD. This SSD has "power > loss protection" according to Intel, and has capacitors built in so that > sudden power loss would not cause you to lose data. In theory such a drive > should be able to ignore the sync calls and can safely buffer the data in > memory, since it is able to write it to flash in the event of a power loss. > > I did some benchmarks of 4K random writes, where a sync was called after each > one (basically worst-case) and I found that the S3500 performed about twice > as fast when zfs_nocacheflush = 1, to disable ZFS's sending flush commands to > the drive. You can also disable the SYNCHRONIZE_CACHE on a per-disk-model basis. This allows you to have both volatile and nonvolatile disk caches in the same system. See sd(7d) and the code shows a "cache-nonvolatile" option, unfortunately not documented in the man page :-( > > So my first question is: Should I be operating the S3500 with > zfs_nocacheflush = 1? My theory was that this would do nothing for this > drive, as it should never need to be told to flush, but is it for some reason > forcing a flush even though it doesn't need to? My understanding for > zfs_nocacheflush is that this should be safe for this drive, since it has the > power loss protection. Am I incorrect in this? Ask Intel. > > Another interesting data point is that I switched the Intel S3500 out for a > Seagate 600 Pro drive. The "Pro" version of this drive is labeled an > "Enterprise" drive and advertises power loss protection, and teardowns > confirm a capacitor bank. > > With zfs_nocacheflush = 1, this drive is very fast, even faster than the > Intel S3500. But when allowed to send the flush cache commands after every 4K > sync write, it becomes incredibly slow. Like 1/20th the speed or less, and > really not much faster than a conventional hard drive. Much slower than the > S3500 when both have zfs_nocacheflush = 0. bummer. Ask Seagate. > > Now, theoretically this drive should be able to have zfs_nocacheflush = 1 as > well, because it has the capacitor bank power loss protection. > > I'm wondering if others have explored this issue and are running with > zfs_nocacheflush? It seems to me that this is a very overlooked part of SSDs > as every benchmark I can find assumes you are writing in async mode, which is > often not the case for databases. Thanks, Yes. But we do it on a per-device basis. For example, for ZeusRAM, SYNCHRONIZE_CACHE is a NOP, so we're better off not bothering to send it. -- richard -- [email protected] +1-760-896-4422 ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
