[also agree with Keith's comments]

On Mar 3, 2014, at 2:35 PM, Nick <[email protected]> wrote:

> I've been testing write performance with SSDs and have a few observations and 
> questions.
> 
> My focus has been on testing random write performance with 4K blocks, which 
> is a commonly benchmarked item.

Do you actually see 4K random writes to the disk?  [bait placed, waiting for 
the hook to set]
Hint: iosnoop -Dast

> The problem with 4K blocks and SSDs is that all SSDs internally use a much 
> larger "erase block". It seems like it is commonly 256K to 2MB in size, 
> depending on the drive, and the way an SSD commonly functions is that if I 
> write 100 4K blocks it will keep them in it's internal RAM until it gets 
> enough to fill an entire erase block, then it performs the actual write to 
> the flash memory. So whenever you see a benchmark online they are operating 
> the SSD in this fashion.

Nit: you should qualify this as "Flash SSDs" -- other SSDs can behave 
differently.

> 
> People running ZFS are generally concerned about data security, and this 
> would be a problem as if the power dies you will lose any data that is in the 
> SSD's RAM but hasn't been written to flash.

Yes, very much so.

> 
> Most newer SSDs seem to honor a sync to actually flush its memory to flash 
> (although some older ones will lie about a sync).

All should, or they violate the spec.

> 
> The issue, when benchmarking 4K blocks for example, is that if I tell the SSD 
> to write 4K of data and then do a sync, it needs to read an entire erase 
> block (256K+) then modify it, and write it, so all of a sudden it has written 
> much more data than it would have if written to in async mode.
> 
> The first SSD I tested was an Intel S3500 Enterprise SSD. This SSD has "power 
> loss protection" according to Intel, and has capacitors built in so that 
> sudden power loss would not cause you to lose data. In theory such a drive 
> should be able to ignore the sync calls and can safely buffer the data in 
> memory, since it is able to write it to flash in the event of a power loss.
> 
> I did some benchmarks of 4K random writes, where a sync was called after each 
> one (basically worst-case) and I found that the S3500 performed about twice 
> as fast when zfs_nocacheflush = 1, to disable ZFS's sending flush commands to 
> the drive.

You can also disable the SYNCHRONIZE_CACHE on a per-disk-model basis.
This allows you to have both volatile and nonvolatile disk caches in the same 
system.
See sd(7d) and the code shows a "cache-nonvolatile" option, unfortunately not 
documented in the man page :-(

> 
> So my first question is: Should I be operating the S3500 with 
> zfs_nocacheflush = 1? My theory was that this would do nothing for this 
> drive, as it should never need to be told to flush, but is it for some reason 
> forcing a flush even though it doesn't need to? My understanding for 
> zfs_nocacheflush is that this should be safe for this drive, since it has the 
> power loss protection. Am I incorrect in this?

Ask Intel.

> 
> Another interesting data point is that I switched the Intel S3500 out for a 
> Seagate 600 Pro drive. The "Pro" version of this drive is labeled an 
> "Enterprise" drive and advertises power loss protection, and teardowns 
> confirm a capacitor bank.
> 
> With zfs_nocacheflush = 1, this drive is very fast, even faster than the 
> Intel S3500. But when allowed to send the flush cache commands after every 4K 
> sync write, it becomes incredibly slow. Like 1/20th the speed or less, and 
> really not much faster than a conventional hard drive. Much slower than the 
> S3500 when both have zfs_nocacheflush = 0.

bummer. Ask Seagate.

> 
> Now, theoretically this drive should be able to have zfs_nocacheflush = 1 as 
> well, because it has the capacitor bank power loss protection.
> 
> I'm wondering if others have explored this issue and are running with 
> zfs_nocacheflush? It seems to me that this is a very overlooked part of SSDs 
> as every benchmark I can find assumes you are writing in async mode, which is 
> often not the case for databases. Thanks,

Yes. But we do it on a per-device basis. For example, for ZeusRAM, 
SYNCHRONIZE_CACHE is a NOP,
so we're better off not bothering to send it.
 -- richard

--

[email protected]
+1-760-896-4422






-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to