Hi all If soneone asked me, I would answer than verification is necessary and using weaker and small hash is fine. Storing the block with it's hash, marked "non deduplicated" is fine, just dedup it later with a background process when filesystems has some idle iops to spend on it, and mark it "deduplicated" when done. I've the feeling that tux3 design is neat to do such a thing (multiple trees, just add one to the forest).
By the way, may the force be with you all, you're doind such a great job here ! Let me also know if you need french localization I'm spend my spare time on it ! Michael B. On Wed, Feb 25, 2009 at 1:53 PM, Philipp Marek <philipp.ma...@emerion.com>wrote: > On Mittwoch, 25. Februar 2009, Christensen Stefan wrote: > > > -----Original Message----- > > > From: Philipp Marek [mailto:philipp.ma...@emerion.com] > > > Sent: Wednesday, February 25, 2009 12:31 PM > > > That's the question ... if it's "cryptographically secure", > > > it means (AFAIU) that it's "hard" to get collisions ... but > > > it's not impossible. > > > Really, it's *guaranteed* that on a large-enough filesystem > > > (some TB, anyone?) you'll get two blocks with the same hash value. > > > > With a 512bit hash value from SHA-2(which is considdered > > collision-resistant), you'll probably get a collision roughly > > after 2**256 blocks you hash(Birthday paradox). This equates to > > an extremely large filesystem(2**218 PiB). By using SHA-1(which has > > Some problems, besides is limited size), you'll get a collision > > After about 2**80 blocks. 2**80 blocks is still a very large > > filesystem(2**42 PiB). > Yes, I know. Thank you for calculating ;-) > > > > Therefore I asked whether the risk is acceptable ... there > > > has been some filesystem (I think that was more than 10 years > > > ago, didn't find a link) that tried deduplication by some > > > hash - but got shot down, because without > > > *verification* that the data is identical you might > > > *silently* shoot yourself (and all others) in the foot. > > > > By using a large enough hash-value there shouldn't be a problem. > > But it might be a filesystem-option. > My point is - either there *is* verification (then the hash function itself > doesn't matter that much), or there is *none*. > In the latter case you risk trashing your data. > > As the amount of data stored will only grow, there's an increasing risk of > collisions. > > And, if you use a 512 bit hash for 4096*8 bits of data, you have 1/64th of > your storage wasted for the data index alone. > > > > Ok. > > > But if verification is needed anyway, then something *much* > > > simpler (and > > > *much* faster) would be ok, too. > > > > Any hash-function that you'll use have a much shorter calculation > > time than any access to rotating media. Even SSD's are slower than > > what a mainstream CPU can calculate secure hashes from. > The calculation itself, yes. > But if you're getting 1MB of data, and have to tell some hardware to do 256 > individual SHA2 calculations of 4kB each, you'll have some latency. > > If that's a simple calculation in the CPU, then you can already ask the SSD > for the first (expected) data block after hashing the first 4kB. > > Maybe it's better via extra hardware - I don't know. > I just think that > - a *big* hash, for collision-resistance, takes too much space; and > - a smaller hash has probably collisions in our lifetime. > So take some ASIC or GPU, and use that for a *simple* hash calculation; but > *verify* the block, to make sure that nothing bad happens. > > > Regards, > > Phil > > > _______________________________________________ > Tux3 mailing list > Tux3@tux3.org > http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3 >
_______________________________________________ Tux3 mailing list Tux3@tux3.org http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3