This isn't a question, I'm just offering it as a cautionary tale and an opportunity to laugh at my own stupidity.

I have a small function to calculate the MD5 checksum for a file. It's nothing fancy:

###################################
import hashlib
def md5(filename, bufsize=65536):
    """
    Compute md5 hash of the named file
    bufsize is 64K by default
    """
    m = hashlib.md5()
    with open(filename,"rb") as fd:
        content = fd.read(bufsize)
        while content != "":
            m.update(content)
            content = fd.read(bufsize)
    return m.hexdigest()
###################################

I've discovered a need to calculate the checksum on the first 10K or so bytes of the file (faster when processing a whole CDROM or DVDROM full of large files; and also allows me to find when one file is a truncated copy of another).

This seemed like an easy enough variation, and I came up with something like this:

###################################
def md5_partial(filename, bufsize=65536, numbytes=10240):
    """
    Compute md5 hash of the first numbytes (10K by default) of named file
    bufsize is 64K by default
    """
    m = hashlib.md5()
    with open(filename,"rb") as fd:
        bytes_left = numbytes
        bytes_to_read = min(bytes_left, bufsize)
        content = fd.read(bytes_to_read)
        bytes_left = bytes_left - bytes_to_read
        while content != "" and bytes_left >0:
            m.update(content)
            bytes_to_read=min(bytes_left, bufsize)
            content = fd.read(bytes_to_read)
            bytes_left = bytes_left - bytes_to_read
    return m.hexdigest()
###################################

Okay, not elegant, and violates DRY a little bit, but what the heck.

I set up a small file (a few hundred bytes) and confirmed that md5 and md5_partial both returned the same value (where the number of bytes I was sampling exceeded the size of the file). Great, working as desired.

But then when I tried a larger file, I was still getting the same checksum for both. It was clearly processing the entire file.

I started messing with it; putting in counters and print statements, using the Gettysburg Address as sample daya and iterating over 20 bytes at a time, printing out each one, making sure it stopped appropriately. Still no luck.

I spent 90 minutes over two sessions when I finally found my error.

My invocation of the first checksum was:

###################################
checksumvalue = my.hashing.md5("filename.txt")
# (Not an error: I keep my own modules in Lib/site-packages/my/ )
print checksumvalue
#
# [several lines of code that among other things, define my new
# function being tested]
#
checksumvalue2 = md5_partial("filename.txt", numbytes=200
print checksumvalue

Turns out my function was working correctly all along; but with my typo, I was printing out the value from the first checksum each time. Doh!

Well, no harm done, other than wasted time, and I did turn up a silly but harmless off-by-one error in the process.
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to