[PHP] Which hashing algorithm is best to check file duplicity?
I want to store the file's hash to the database, so I can check next time to see if that file was already uploaded (even if it was renamed). What would be the best (= fastest + small chance of collision) algorithm in this case? Is crc32 a good choice? Thank you in advance, Martin -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Which hashing algorithm is best to check file duplicity?
2009/3/15 Martin Zvarík mzva...@gmail.com: I want to store the file's hash to the database, so I can check next time to see if that file was already uploaded (even if it was renamed). What would be the best (= fastest + small chance of collision) algorithm in this case? Is crc32 a good choice? guess not. maybe unhex(md5()) into a binary(16) field? What I'm trying to say is, that crc32 is more likely to have collisions as a better algorithm like sha1, md5, .. and that the datatype for the db should be considered. byebye -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Which hashing algorithm is best to check file duplicity?
On Sun, Mar 15, 2009 at 10:25:11PM +0100, Martin Zvarík wrote: I want to store the file's hash to the database, so I can check next time to see if that file was already uploaded (even if it was renamed). What would be the best (= fastest + small chance of collision) algorithm in this case? Is crc32 a good choice? Thank you in advance, Martin According to wikipedia, a CRC is not sufficient to detect intentional alteration of a file/message, since it's relatively easy to design a message/file which will have the same CRC. On the other hand, CRC, by design will reliably detect subtle changes to a message/file. Paul -- Paul M. Foster -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Which hashing algorithm is best to check file duplicity?
Martin Zvarík wrote: I want to store the file's hash to the database, so I can check next time to see if that file was already uploaded (even if it was renamed). What would be the best (= fastest + small chance of collision) algorithm in this case? Fastest depends mostly on the size of the file, not the algorithm used. A 2gig file will take a while using md5 as it will using sha1. Using md5 will be slightly quicker than sha1 because generates a shorter hash so the trade-off is up to you. $ ls -lh file.gz 724M 2008-07-28 10:02 file.gz $ time sha1sum file.gz 4ae7bd1e79088a3e3849e17c7be989d4a7c97450 file.gz real0m3.398s user0m3.056s sys 0m0.336s $ time md5sum file.gz 16cff7b95bcb5971daf1cabee6ca4edd file.gz real0m2.091s user0m1.744s sys 0m0.328s $ time sha1sum file.gz 4ae7bd1e79088a3e3849e17c7be989d4a7c97450 file.gz real0m3.332s user0m2.988s sys 0m0.344s $ time md5sum file.gz 16cff7b95bcb5971daf1cabee6ca4edd file.gz real0m2.136s user0m1.776s sys 0m0.348s -- Postgresql php tutorials http://www.designmagick.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Which hashing algorithm is best to check file duplicity?
Fastest depends mostly on the size of the file, not the algorithm used. A 2gig file will take a while using md5 as it will using sha1. Using md5 will be slightly quicker than sha1 because generates a shorter hash so the trade-off is up to you. $ ls -lh file.gz 724M 2008-07-28 10:02 file.gz $ time sha1sum file.gz 4ae7bd1e79088a3e3849e17c7be989d4a7c97450 file.gz real0m3.398s user0m3.056s sys0m0.336s $ time md5sum file.gz 16cff7b95bcb5971daf1cabee6ca4edd file.gz real0m2.091s user0m1.744s sys0m0.328s $ time sha1sum file.gz 4ae7bd1e79088a3e3849e17c7be989d4a7c97450 file.gz real0m3.332s user0m2.988s sys0m0.344s $ time md5sum file.gz 16cff7b95bcb5971daf1cabee6ca4edd file.gz real0m2.136s user0m1.776s sys0m0.348s Aha, thanks for sharing the benchmark. I'll go with MD5() -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php