[issue45150] Add a file_digest() function in hashlib

2022-03-22 Thread Christian Heimes


Christian Heimes  added the comment:


New changeset e03db6d5be7cf2e6b7b55284985c404de98a9420 by Christian Heimes in 
branch 'main':
bpo-45150: Fix testing under FIPS mode (GH-32046)
https://github.com/python/cpython/commit/e03db6d5be7cf2e6b7b55284985c404de98a9420


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-22 Thread Christian Heimes


Change by Christian Heimes :


--
pull_requests: +30136
pull_request: https://github.com/python/cpython/pull/32046

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-22 Thread miss-islington


miss-islington  added the comment:


New changeset 4f97d64c831c94660ceb01f34d51fa236ad968b0 by Christian Heimes in 
branch 'main':
bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930)
https://github.com/python/cpython/commit/4f97d64c831c94660ceb01f34d51fa236ad968b0


--
nosy: +miss-islington

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-16 Thread Aur Saraf


Aur Saraf  added the comment:

I don't think HMAC of a file is a common enough use case to support, but I have 
absolutely no problem conceding this point, the cost of supporting it is very 
low.

I/O in C is a world of pain in general. In the specific case of `io.RawIOBase` 
objects (non-buffered binary files) to my understanding it's not _that_ 
terrible (am I right? Does my I/O code work as-is?). To my understanding, 
providing a fast path *just for this case* that calculates the hash without 
taking the GIL for every chunk would be very nice to have for many use cases.

Now, we could just be happy with `file_digest()` having an `if` for 
`isinstance(io.RawIOBase)` that chooses a fast code path silently. But since 
non-buffered binary files are so hard to tell apart from other types of 
file-like objects, as a user of this code I would like to have a way to say "I 
want the fast path, please raise if I accidentally passed the wrong things and 
got the regular path". We could have `file_digest('sha256', open(path, 'rb', 
buffered=0), ensure_fast_io=True)`, but I think for this use case 
`raw_file_digest('sha256', open(path, 'rb', buffered=0))` is cleaner.

In all other cases you just call `file_digest()`, probably get the Python I/O 
and not the C I/O, and are still happy to have that loop written for you by 
someone who knows what they're doing.

For the same reason I think the fast path should only support hash names and 
not constructors/functions/etc', which would complicate it because 
new-object-can-be-accessed-without-GIL wouldn't necessarily apply.

Does this make sense?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-16 Thread Christian Heimes


Change by Christian Heimes :


--
pull_requests: +30021
pull_request: https://github.com/python/cpython/pull/31930

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-16 Thread Christian Heimes


Christian Heimes  added the comment:

Before we continue hacking on an implementation, let's discuss some API design.

- Multiple functions or multiple dispatch (name, fileobj) are confusing to 
users. Let's keep it simple and only implement one function that operates on 
file objects.

- The function should work with most binary file-like including open(..., 
"rb"), BytesIO, SocketIO. mmap.mmap() is not file-like enough. Anon mapping 
doesn't provide a fileno and the mmap object has no readinto().

- The function should accept either digest name, digest constructor, or a 
callable that returns a digest object. The latter makes it possible to reuse 
file_digest() with MAC constructs like HMAC.

- Don't do any I/O in C unless you are prepared to enter a world of pain and 
suffering. It's hard to get it right across platforms. For example your C code 
does not work for SocketIO on Windows.

- If we decide to implement an accelerator in C, then we don't have to bother 
with our fallback copies like _sha256module.c. They are slow and only used when 
OpenSSL is not available.

--
assignee: tarek -> christian.heimes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-15 Thread Aur Saraf


Aur Saraf  added the comment:

Added an attempt to handle signals. I don't think it's working, because when I 
press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after 
waiting the amount of time it usually takes the C code to return, but maybe 
that's not a good test?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-15 Thread Aur Saraf


Aur Saraf  added the comment:

Forgot an important warning: this is the first time I write C code against the 
Python API, and I didn't thoroughly read the guide (or at all, to be honest). I 
think I did a good job, but please suspect my code of noob errors.

I'm especially not confident that it's OK to not do any special handling of 
signals. Can read() return 0 if it was interrupted by a signal? This will stop 
the hash calculation midway and behave as if it succeeded. Sounds suspiciously 
like something we don't want. Also, I probably should support signals because 
such a long operation is something the user definitely might want to interrupt?

May I have some guidance please? Would it be enough to copy the code from 
fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the 
GIL released and still call PyErr_CheckSignals when needed with the GIL taken?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-15 Thread Aur Saraf


Aur Saraf  added the comment:

The rationale behind `from_raw_file()` and the special treatment of 
non-buffered IO is that there is no `read_buffer()` API or other clean way to 
say "I want to read just what's currently in the buffer so that from now on I 
could read directly from the file descriptor without harm".

If you want to read from a buffered file object, sure, just call `from_file()`. 
If you want to ensure you'll get full performance benefits, call 
`from_raw_file()`. If you pass an eligible file object to `from_file()` you'll 
get the benefits anyway, because why not.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-15 Thread Aur Saraf


Aur Saraf  added the comment:

PR contains a draft implementation, would appreciate some review before I 
implement the same interface on all builtin hashes as well as OpenSSL hashes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-15 Thread Roundup Robot


Change by Roundup Robot :


--
pull_requests: +30019
pull_request: https://github.com/python/cpython/pull/31928

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-14 Thread Aur Saraf


Aur Saraf  added the comment:

OK, I'll give it a go.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-14 Thread Tarek Ziadé

Tarek Ziadé  added the comment:

@Aur, go for it, I started to implement it and got lost into the details for 
each backend..

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2022-03-14 Thread Aur Saraf


Aur Saraf  added the comment:

Tarek,

Are you still working on this? Would you like me to take over?

Aur

--
nosy: +Aur.Saraf

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2021-09-09 Thread Tarek Ziadé

Tarek Ziadé  added the comment:

Hey Christian, I hope things are well for you!
Thanks for all the precious feedback, I'll rework the patch accordingly

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2021-09-09 Thread Christian Heimes


Christian Heimes  added the comment:

Hey Tarek, long time no see!

* the _sha256 module is optional, can be disabled and is not available in some 
distributions.

* I also don't like to use sha256 has default. It's slow, even slower than 
sha512. Any default makes it also harder to upgrade to a better, more secure 
default in the future.

* like hmac.new() a file_digest() should accept PEP 452-compatible arguments 
and hash name as digstmod argument, not just a callable.

* a filename argument prevents users from passing in file-like objects like 
BytesIO.

* 4096 bytes chunk size is very conservative. The call overhead for read() and 
update() may dominate the performance of the function.

* The hex argument feels weird.

In a perfect world, the hash and hmac objects should get an "update_file" 
method. The OpenSSL-based hashes could even release the GIL and utilize 
OpenSSL's BIO layer to avoid any Python overhead.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2021-09-09 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
nosy: +christian.heimes, gregory.p.smith
type:  -> enhancement
versions: +Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2021-09-09 Thread Roundup Robot


Change by Roundup Robot :


--
keywords: +patch
nosy: +python-dev
nosy_count: 1.0 -> 2.0
pull_requests: +26673
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/28252

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45150] Add a file_digest() function in hashlib

2021-09-09 Thread Tarek Ziadé

New submission from Tarek Ziadé :

I am proposing the addition of a very simple helper to return the hash of a 
file.

--
assignee: tarek
components: Library (Lib)
messages: 401457
nosy: tarek
priority: normal
severity: normal
status: open
title: Add a file_digest() function in hashlib

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com