On 2013-01-20 19:55, Tomas Forsman wrote:
On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:
While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact "unique" - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.
Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
"the real thing").
Yes, but that was asked before =)
Actually, the pool's metadata does contain all the needed bits (i.e.
checksum and size of blocks) such that a scrub-like procedure could
try and find same blocks among unique ones (perhaps with a filter
of "this" block being referenced from a dataset that currently wants
dedup), throw one out and add a DDT entry to another.
On 2013-01-20 17:16, Edward Harvey wrote:
> So ... The way things presently are, ideally you would know in
> advance what stuff you were planning to write that has duplicate
> copies. You could enable dedup, then write all the stuff that's
> highly duplicated, then turn off dedup and write all the
> non-duplicate stuff. Obviously, however, this is a fairly
> implausible actual scenario.
Well, I guess I could script a solution that uses ZDB to dump the
blockpointer tree (about 100Gb of text on my system), and some
perl or sort/uniq/grep parsing over this huge text to find blocks
that are the same but not deduped - as well as those single-copy
"deduped" ones, and toggle the dedup property while rewriting the
block inside its parent file with DD.
This would all be within current ZFS's capabilities and ultimately
reach the goals of deduping pre-existing data as well as dropping
unique blocks from the DDT. It would certainly not be a real-time
solution (likely might take months on my box - just fetching the
BP tree took a couple of days) and would require more resources
than needed otherwise (rewrites of same userdata, storing and
parsing of addresses as text instead of binaries, etc.)
But I do see how this is doable even today even by a non-expert ;)
(Not sure I'd ever get around to actually doing this thus, though -
it is not a very "clean" solution nor a performant one).
As a bonus, however, this ZDB dump would also provide an answer
to a frequently-asked question: "which files on my system intersect
or are the same - and have some/all blocks in common via dedup?"
Knowledge of this answer might help admins with some policy
decisions, be it witch-hunt for hoarders of same files or some
pattern-making to determine which datasets should keep "dedup=on"...
My few cents,
zfs-discuss mailing list