On Tue, May 11, 2010 at 04:15:24AM -0700, Bertrand Augereau wrote:
> Is there a O(nb_blocks_for_the_file) solution, then?
> 
> I know O(nb_blocks_for_the_file) == O(nb_bytes_in_the_file), from Mr. 
> Landau's POV, but I'm quite interested in a good constant factor.

If you were considering the hashes of each zfs block as a precomputed
value, it might be tempting to think of getting all of these and
hashing them together.  You could thereby avoiding reading file data,
and the file metadata with the hashes in, you'd have needed to read
anyway. This would seem to be appealing, eliminating seeks and cpu
work. 

However, there are some issues that make the approach basically
infeasible and unreliable for comparing the results of two otherwise
identical files.

First, you're assuming there's an easy interface to get the stored
hashes of a block, which there isn't.  Even if we ignore that for a
moment, the hashes zfs records depend on factors other than just the
file content, including the way the file has been written over time.  

The blocks of the file may not be constant size; a file that grew
slowly may have different hashes to a copy of it or one extracted
from an archive in a fast stream.  Filesystem properties, including
checksum (obvious), dedup (which implies checksum), compress (which
changes written data and can make holes), blocksize and maybe others
may be different between filesystems or even change over the time a
file has been written, and again change results and defeat
comparisons.

These things can defeat zfs's dedup too, even though it does have
access to the block level checksums.

If you're going to do an application-level dedup, you want to utilise
the advantage of being independent of these things - or even of the
underlying filesystem at all (e.g. dedup between two NAS shares).

Something similar would be useful, and much more readily achievable,
from ZFS from such an application, and many others.  Rather than a way
to compare reliably between two files for identity, I'ld liek a way to
compare identity of a single file between two points in time.  If my
application can tell quickly that the file content is unaltered since
last time I saw the file, I can avoid rehashing the content and use a
stored value. If I can achieve this result for a whole directory
tree, even better.

--
Dan.



Attachment: pgp1HgRATGs5S.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to