On 21 janv. 2010, at 22:55, Daniel Carosone wrote:

> On Thu, Jan 21, 2010 at 05:04:51PM +0100, erik.ableson wrote:
> 
>> What I'm trying to get a handle on is how to estimate the memory
>> overhead required for dedup on that amount of storage.   
> 
> We'd all appreciate better visibility of this. This requires:
> - time and observation and experience, and
> - better observability tools and (probably) data exposed for them

I'd guess that since every written block is going to go and ask for the hash 
keys, this should result in this data living in the ARC based on the MFU 
ruleset.  The theory being that as a result if I can determine the maximum 
memory requirement for these keys, I know what my minimum memory baseline 
requirements will be to guarantee that I won't be caught short.

>> So the question is how much memory or L2ARC would be necessary to
>> ensure that I'm never going back to disk to read out the hash keys. 
> 
> I think that's a wrong-goal for optimisation.
> 
> For performance (rather than space) issues, I look at dedup as simply
> increasing the size of the working set, with a goal of reducing the
> amount of IO (avoided duplicate writes) in return.

True.  but as a practical aspect, we've seen that overall performance drops off 
the cliff if you overstep your memory bounds and the system is obliged to go to 
disk to evaluate a new block to write against the hash keys. Compounded by the 
fact that the ARC is full so it's obliged to go straight to disk, further 
exacerbating the problem.

It's this particular scenario that I'm trying to avoid and from a business 
aspect of selling ZFS based solutions (whether to a client or to an internal 
project) we need to be able to ensure that the performance is predictable with 
no surprises.

Realizing of course that all of this is based on a slew of uncontrollable 
variables (size of the working set, IO profiles, ideal block sizes, etc.).  The 
empirical approach of "give it lots and we'll see if we need to add an L2ARC 
later" is not really viable for many managers (despite the fact that the real 
world works like this).

> The trouble is that the hash function produces (we can assume) random
> hits across the DDT, so the working set depends on the amount of
> data and the rate of potentially dedupable writes as well as the
> actual dedup hit ratio.  A high rate of writes also means a large
> amount of data in ARC waiting to be written at the same time. This
> makes analysis very hard (and pushes you very fast towards that very
> steep cliff, as we've all seen). 

I don't think  it would be random since _any_ write operation on a deduplicated 
filesystem would require a hash check, forcing them to live in the MFU.  
However I agree that a high write rate would result in memory pressure on the 
ARC which could result in the eviction of the hash keys. So the next factor to 
include in memory sizing is the maximum write rate (determined by IO 
availability). So with a team of two GbE cards, I could conservatively say that 
I need to size for inbound write IO of 160MB/s, worst case accumulated for the 
30 second flush cycle so, say about 5GB of memory (leaving aside ZIL issues 
etc.). Noting that this is all very back of the napkin estimations, and I also 
need to have some idea of what my physical storage is capable of ingesting 
which could add to this value.

> I also think a threshold on the size of blocks to try deduping would
> help.  If I only dedup blocks (say) 64k and larger, i might well get
> most of the space benefit for much less overhead.

Well - since my primary use case is iSCSI presentation to VMware backed by 
zvols and I can manually force the block size on volume creation to 64, this 
reduces the unpredictability a little bit. That's based on the hypothesis that 
zvols use a fixed block size.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to