bloom filters are a great fit for this :-)
On Jan 19, 2013, at 5:59 PM, Nico Williams <n...@cryptonector.com> wrote:
> I've wanted a system where dedup applies only to blocks being written
> that have a good chance of being dups of others.
> I think one way to do this would be to keep a scalable Bloom filter
> (on disk) into which one inserts block hashes.
> To decide if a block needs dedup one would first check the Bloom
> filter, then if the block is in it, use the dedup code path, else the
> non-dedup codepath and insert the block in the Bloom filter. This
> means that the filesystem would store *two* copies of any
> deduplicatious block, with one of those not being in the DDT.
> This would allow most writes of non-duplicate blocks to be faster than
> normal dedup writes, but still slower than normal non-dedup writes:
> the Bloom filter will add some cost.
> The nice thing about this is that Bloom filters can be sized to fit in
> main memory, and will be much smaller than the DDT.
> It's very likely that this is a bit too obvious to just work.
> Of course, it is easier to just use flash. It's also easier to just
> not dedup: the most highly deduplicatious data (VM images) is
> relatively easy to manage using clones and snapshots, to a point
> zfs-discuss mailing list
zfs-discuss mailing list