[PHP] Which hashing algorithm is best to check file duplicity?

2009-03-15 Thread Martin Zvarík
I want to store the file's hash to the database, so I can check next 
time to see if that file was already uploaded (even if it was renamed).


What would be the best (= fastest + small chance of collision) algorithm 
in this case?


Is crc32 a good choice?

Thank you in advance,
Martin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Which hashing algorithm is best to check file duplicity?

2009-03-15 Thread Jan G.B.
2009/3/15 Martin Zvarík mzva...@gmail.com:
 I want to store the file's hash to the database, so I can check next time to
 see if that file was already uploaded (even if it was renamed).

 What would be the best (= fastest + small chance of collision) algorithm in
 this case?

 Is crc32 a good choice?
guess not.
maybe unhex(md5()) into a binary(16) field?
What I'm trying to say is, that crc32 is more likely to have
collisions as a better algorithm like sha1, md5, ..
and that the datatype for the db should be considered.

byebye

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Which hashing algorithm is best to check file duplicity?

2009-03-15 Thread Paul M Foster
On Sun, Mar 15, 2009 at 10:25:11PM +0100, Martin Zvarík wrote:

 I want to store the file's hash to the database, so I can check next
 time to see if that file was already uploaded (even if it was renamed).

 What would be the best (= fastest + small chance of collision) algorithm
 in this case?

 Is crc32 a good choice?

 Thank you in advance,
 Martin

According to wikipedia, a CRC is not sufficient to detect intentional
alteration of a file/message, since it's relatively easy to design a
message/file which will have the same CRC. On the other hand, CRC, by
design will reliably detect subtle changes to a message/file.

Paul

-- 
Paul M. Foster

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Which hashing algorithm is best to check file duplicity?

2009-03-15 Thread Chris

Martin Zvarík wrote:
I want to store the file's hash to the database, so I can check next 
time to see if that file was already uploaded (even if it was renamed).


What would be the best (= fastest + small chance of collision) algorithm 
in this case?


Fastest depends mostly on the size of the file, not the algorithm 
used. A 2gig file will take a while using md5 as it will using sha1.


Using md5 will be slightly quicker than sha1 because generates a shorter 
hash so the trade-off is up to you.


$ ls -lh file.gz

724M 2008-07-28 10:02 file.gz

$ time sha1sum file.gz
4ae7bd1e79088a3e3849e17c7be989d4a7c97450  file.gz

real0m3.398s
user0m3.056s
sys 0m0.336s

$ time md5sum file.gz
16cff7b95bcb5971daf1cabee6ca4edd  file.gz

real0m2.091s
user0m1.744s
sys 0m0.328s

$ time sha1sum file.gz
4ae7bd1e79088a3e3849e17c7be989d4a7c97450  file.gz

real0m3.332s
user0m2.988s
sys 0m0.344s

$ time md5sum file.gz
16cff7b95bcb5971daf1cabee6ca4edd  file.gz

real0m2.136s
user0m1.776s
sys 0m0.348s

--
Postgresql  php tutorials
http://www.designmagick.com/


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Which hashing algorithm is best to check file duplicity?

2009-03-15 Thread Martin Zvarík


Fastest depends mostly on the size of the file, not the algorithm 
used. A 2gig file will take a while using md5 as it will using sha1.


Using md5 will be slightly quicker than sha1 because generates a 
shorter hash so the trade-off is up to you.


$ ls -lh file.gz

724M 2008-07-28 10:02 file.gz

$ time sha1sum file.gz
4ae7bd1e79088a3e3849e17c7be989d4a7c97450  file.gz

real0m3.398s
user0m3.056s
sys0m0.336s

$ time md5sum file.gz
16cff7b95bcb5971daf1cabee6ca4edd  file.gz

real0m2.091s
user0m1.744s
sys0m0.328s

$ time sha1sum file.gz
4ae7bd1e79088a3e3849e17c7be989d4a7c97450  file.gz

real0m3.332s
user0m2.988s
sys0m0.344s

$ time md5sum file.gz
16cff7b95bcb5971daf1cabee6ca4edd  file.gz

real0m2.136s
user0m1.776s
sys0m0.348s

Aha, thanks for sharing the benchmark. I'll go with MD5()

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php