Re: checksum problem

2018-01-30 Thread Chris Angelico
On Wed, Jan 31, 2018 at 6:21 AM, Peter Pearson
 wrote:
> On Tue, 30 Jan 2018 11:24:07 +0100, jak  wrote:
>>  with open(fname, "rb") as fh:
>>  for data in fh.read(m.block_size * blocks):
>>  m.update(data)
>>  return m.hexdigest()
>>
>
> I believe your "for data in fh.read" loop just reads the first block of
> the file and loops over the bytes in that block (calling m.update once
> for each byte, probably the least efficient approach imaginable),
> omitting the remainder of the file.  That's why you start getting the
> right answer when the first block is big enough to encompass the whole
> file.

Correct analysis.

Generally, if you want to read a file in chunks, the easiest way is this:

while "moar data":
data = fh.read(block_size)
if not data: break
m.update(data)

That should get you the correct result regardless of your block size,
and then you can tweak the block size to toy with performance.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: checksum problem

2018-01-30 Thread Peter Pearson
On Tue, 30 Jan 2018 11:24:07 +0100, jak  wrote:
> Hello everybody,
> I'm using python 2.7.14 and calculating the checksum with the sha1 
> algorithm and this happens: the checksum is wrong until I read the whole 
> file in one shot. Here is a test program:
>
> import hashlib
>
> def Checksum(fname, blocks):
>  m = hashlib.sha1()
>  print "sha1 block size: " + str(m.block_size * blocks)
>  with open(fname, "rb") as fh:
>  for data in fh.read(m.block_size * blocks):
>  m.update(data)
>  return m.hexdigest()
>
> def main():
>  for b in range(10, 260, 10):
>  print str(b) + ': ' + 
> Checksum("d:/upload_688df390ea0bd728fdbeb8972ae5f7be.zip", b)
>
> if __name__ == '__main__':
>  main()
>
> and this is the result output:
>
> sha1 block size: 640
> 10: bf09de3479b2861695fb8b7cb18133729ef00205
> sha1 block size: 1280
> 20: 71a5499e4034fdcf0eb0c5d960c8765a8b1f032d
> .
> .
> .
> sha1 block size: 12160
> 190: 956d017b7ed734a7b4bfdb02519662830dab4fbe
> sha1 block size: 12800
> 200: 1b2febe05b70f58350cbb87df67024ace43b76e5
> sha1 block size: 13440
> 210: 93832713edb40cf4216bbfec3c659842fbec6ae4
> sha1 block size: 14080
> 220: 93832713edb40cf4216bbfec3c659842fbec6ae4
> .
> .
> .
>
> the file size is 13038 bytes and its checksum is 
> 93832713edb40cf4216bbfec3c659842fbec6ae4
>
> Why do I get these results? What am I doing wrong?
>
> Thanks to everyone in advance.

I believe your "for data in fh.read" loop just reads the first block of
the file and loops over the bytes in that block (calling m.update once
for each byte, probably the least efficient approach imaginable),
omitting the remainder of the file.  That's why you start getting the
right answer when the first block is big enough to encompass the whole
file.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list