Re: [Tux3] Patch : Data Deduplication in Userspace

Michael Keulkeul Wed, 25 Feb 2009 07:56:10 -0800

Hi all

If soneone asked me, I would answer than verification is necessary and using
weaker and small hash is fine.
Storing the block with it's hash, marked "non deduplicated" is fine, just
dedup it later with a background process when filesystems has some idle iops
to spend on it, and mark it "deduplicated" when done.
I've the feeling that tux3 design is neat to do such a thing (multiple
trees, just add one to the forest).


By the way, may the force be with you all, you're doind such a great job
here !
Let me also know if you need french localization I'm spend my spare time on
it !

Michael B.

On Wed, Feb 25, 2009 at 1:53 PM, Philipp Marek <philipp.ma...@emerion.com>wrote:

> On Mittwoch, 25. Februar 2009, Christensen Stefan wrote:
> > > -----Original Message-----
> > > From: Philipp Marek [mailto:philipp.ma...@emerion.com]
> > > Sent: Wednesday, February 25, 2009 12:31 PM
> > > That's the question ... if it's "cryptographically secure",
> > > it means (AFAIU) that it's "hard" to get collisions ... but
> > > it's not impossible.
> > > Really, it's *guaranteed* that on a large-enough filesystem
> > > (some TB, anyone?) you'll get two blocks with the same hash value.
> >
> > With a 512bit hash value from SHA-2(which is considdered
> > collision-resistant), you'll probably get a collision roughly
> > after 2**256 blocks you hash(Birthday paradox). This equates to
> > an extremely large filesystem(2**218 PiB). By using SHA-1(which has
> > Some problems, besides is limited size), you'll get a collision
> > After about 2**80 blocks. 2**80 blocks is still a very large
> > filesystem(2**42 PiB).
> Yes, I know. Thank you for calculating ;-)
>
> > > Therefore I asked whether the risk is acceptable ... there
> > > has been some filesystem (I think that was more than 10 years
> > > ago, didn't find a link) that tried deduplication by some
> > > hash - but got shot down, because without
> > > *verification* that the data is identical you might
> > > *silently* shoot yourself (and all others) in the foot.
> >
> > By using a large enough hash-value there shouldn't be a problem.
> > But it might be a filesystem-option.
> My point is - either there *is* verification (then the hash function itself
> doesn't matter that much), or there is *none*.
> In the latter case you risk trashing your data.
>
> As the amount of data stored will only grow, there's an increasing risk of
> collisions.
>
> And, if you use a 512 bit hash for 4096*8 bits of data, you have 1/64th of
> your storage wasted for the data index alone.
>
> > > Ok.
> > > But if verification is needed anyway, then something *much*
> > > simpler (and
> > > *much* faster) would be ok, too.
> >
> > Any hash-function that you'll use have a much shorter calculation
> > time than any access to rotating media. Even SSD's are slower than
> > what a mainstream CPU can calculate secure hashes from.
> The calculation itself, yes.
> But if you're getting 1MB of data, and have to tell some hardware to do 256
> individual SHA2 calculations of 4kB each, you'll have some latency.
>
> If that's a simple calculation in the CPU, then you can already ask the SSD
> for the first (expected) data block after hashing the first 4kB.
>
> Maybe it's better via extra hardware - I don't know.
> I just think that
> - a *big* hash, for collision-resistance, takes too much space; and
> - a smaller hash has probably collisions in our lifetime.
> So take some ASIC or GPU, and use that for a *simple* hash calculation; but
> *verify* the block, to make sure that nothing bad happens.
>
>
> Regards,
>
> Phil
>
>
> _______________________________________________
> Tux3 mailing list
> Tux3@tux3.org
> http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
>

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

Re: [Tux3] Patch : Data Deduplication in Userspace

Reply via email to