I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point

zfs-discuss mailing list

Reply via email to