[Tutor] Stupid bug

Terry Carroll Wed, 10 Nov 2010 16:42:52 -0800

This isn't a question, I'm just offering it as a cautionary tale and anopportunity to laugh at my own stupidity.

I have a small function to calculate the MD5 checksum for a file. It'snothing fancy:


###################################
import hashlib
def md5(filename, bufsize=65536):
    """
    Compute md5 hash of the named file
    bufsize is 64K by default
    """
    m = hashlib.md5()
    with open(filename,"rb") as fd:
        content = fd.read(bufsize)
        while content != "":
            m.update(content)
            content = fd.read(bufsize)
    return m.hexdigest()
###################################

I've discovered a need to calculate the checksum on the first 10K or sobytes of the file (faster when processing a whole CDROM or DVDROM full oflarge files; and also allows me to find when one file is a truncated copyof another).

This seemed like an easy enough variation, and I came up with somethinglike this:


###################################
def md5_partial(filename, bufsize=65536, numbytes=10240):
    """
    Compute md5 hash of the first numbytes (10K by default) of named file
    bufsize is 64K by default
    """
    m = hashlib.md5()
    with open(filename,"rb") as fd:
        bytes_left = numbytes
        bytes_to_read = min(bytes_left, bufsize)
        content = fd.read(bytes_to_read)
        bytes_left = bytes_left - bytes_to_read
        while content != "" and bytes_left >0:
            m.update(content)
            bytes_to_read=min(bytes_left, bufsize)
            content = fd.read(bytes_to_read)
            bytes_left = bytes_left - bytes_to_read
    return m.hexdigest()
###################################

Okay, not elegant, and violates DRY a little bit, but what the heck.

I set up a small file (a few hundred bytes) and confirmed that md5 andmd5_partial both returned the same value (where the number of bytes I wassampling exceeded the size of the file). Great, working as desired.

But then when I tried a larger file, I was still getting the same checksumfor both. It was clearly processing the entire file.

I started messing with it; putting in counters and print statements,using the Gettysburg Address as sample daya and iterating over20 bytes at a time, printing out each one, making sure it stoppedappropriately. Still no luck.


I spent 90 minutes over two sessions when I finally found my error.

My invocation of the first checksum was:

###################################
checksumvalue = my.hashing.md5("filename.txt")
# (Not an error: I keep my own modules in Lib/site-packages/my/ )
print checksumvalue
#
# [several lines of code that among other things, define my new
# function being tested]
#
checksumvalue2 = md5_partial("filename.txt", numbytes=200
print checksumvalue

Turns out my function was working correctly all along; but with my typo, Iwas printing out the value from the first checksum each time. Doh!

Well, no harm done, other than wasted time, and I did turn up a silly butharmless off-by-one error in the process.

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Stupid bug

Reply via email to