[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2022-01-20 Thread Jelle Zijlstra


Jelle Zijlstra  added the comment:

This behavior is definitely unfortunate, but by now it's also been baked into 
more than a decade of Python 3 releases, so backward compatibility constraints 
make it difficult to fix.

How can we be sure this change won't break users' code?

For reference, here are a few uses of the function I found in major open-source 
packages:
- 
https://github.com/httplib2/httplib2/blob/cde9e87d8b2c4c5fc966431965998ed5f45d19c7/python3/httplib2/__init__.py#L1608
 - this assumes it only ever hits the (bytes, encoding) case.
- 
https://github.com/cherrypy/cherrypy/blob/98929b519fbca003cbf7b14a6b370a3cabc9c412/cherrypy/lib/httputil.py#L258
 - this assumes it only gets (str, None) or (bytes, encoding) pairs, which 
seems unsafe. But if it currently sees (str, None) and would see (bytes, None) 
with this change, it would break.

An alternative solution could be a new function with a sane return type.

Even if we decide to not change anything, we should document the surprising 
return type at https://docs.python.org/3.10/library/email.header.html.

--
nosy: +Jelle Zijlstra

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2022-01-11 Thread Daniel Lenski


Change by Daniel Lenski :


--
keywords: +patch
pull_requests: +28748
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/30548

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2021-12-30 Thread Daniel Lenski


Daniel Lenski  added the comment:

Due to this bug, any user of this function in Python 3.0+ *already* has to be 
able to handle all of the following outputs in order to use it reliably:

   decode_header(...) -> [(str, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]

== Fix str/bytes inconsistency ==

We could eliminate the inconsistency, and make the function only ever return 
bytes instead of str, with the following changes to 
https://github.com/python/cpython/blob/3.10/Lib/email/header.py.

```
diff --git a/Lib/email/header.py.orig b/Lib/email/header.py
index 4ab0032..41e91f2 100644
--- a/Lib/email/header.py
+++ b/Lib/email/header.py
@@ -61,7 +61,7 @@ _max_append = email.quoprimime._max_append
 def decode_header(header):
 """Decode a message header value without converting charset.
 
-Returns a list of (string, charset) pairs containing each of the decoded
+Returns a list of (bytes, charset) pairs containing each of the decoded
 parts of the header.  Charset is None for non-encoded parts of the header,
 otherwise a lower-case string containing the name of the character set
 specified in the encoded string.
@@ -78,7 +78,7 @@ def decode_header(header):
 for string, charset in header._chunks]
 # If no encoding, just return the header with no charset.
 if not ecre.search(header):
-return [(header, None)]
+return [header.encode(), None)]
 # First step is to parse all the encoded parts into triplets of the form
 # (encoded_string, encoding, charset).  For unencoded strings, the last
 # two parts will be None.
```

With these changes, decode_header() would return one of the following:

   decode_header(...) -> [(bytes, None)]
or decode_header(...) -> [(bytes, str)]
or decode_header(...) -> [(bytes, (str|None)), (bytes, (str|None)), ...]


== Ensure that charset is always str, never None ==

A couple more small changes:

```
@@ -92,7 +92,7 @@ def decode_header(header):
 unencoded = unencoded.lstrip()
 first = False
 if unencoded:
-words.append((unencoded, None, None))
+words.append((unencoded, None, 'ascii'))
 if parts:
 charset = parts.pop(0).lower()
 encoding = parts.pop(0).lower()
@@ -133,7 +133,8 @@ def decode_header(header):
 # Now convert all words to bytes and collapse consecutive runs of
 # similarly encoded words.
 collapsed = []
-last_word = last_charset = None
+last_word = None
+last_charset = 'ascii'
 for word, charset in decoded_words:
 if isinstance(word, str):
 word = bytes(word, 'raw-unicode-escape')
```

With these changes, decode_header() would return only:

   decode_header(...) -> List[(bytes, str)]

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2021-12-30 Thread Daniel Lenski

Daniel Lenski  added the comment:

I recently ran into this bug as well.

For those looking for a reliable workaround, here's an implementation of a 
'decode_header_to_string' function which should Just Work™ in all possible 
cases:

#!/usr/bin/python3
import email.header

# Workaround for https://bugs.python.org/issue22833
def decode_header_to_string(header):
'''Decodes an email message header (possibly RFC2047-encoded)
into a string, while working around 
https://bugs.python.org/issue22833'''

return ''.join(
alleged_string if isinstance(alleged_string, str) else 
alleged_string.decode(
alleged_charset or 'ascii')
for alleged_string, alleged_charset in 
email.header.decode_header(header))


for header in ('=?utf-8?B?ZsOzbw==',
   '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
   'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?=',
   'plain string',):
print("Header value: %r" % header)
print("email.header.decode_header(...) -> %r" % 
email.header.decode_header(header))
print("decode_header_to_string(...)-> %r" % 
decode_header_to_string(header))
print("---")

Outputs:

Header value: '=?utf-8?B?ZsOzbw=='
email.header.decode_header(...) -> [('=?utf-8?B?ZsOzbw==', None)]
decode_header_to_string(...)-> '=?utf-8?B?ZsOzbw=='
---
Header value: '=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?='
email.header.decode_header(...) -> [(b'hello', 'ascii'), (b'f\xc3\xb3o', 
'utf-8')]
decode_header_to_string(...)-> 'hellofóo'
---
Header value: 'bar=?ascii?Q?hello?==?utf-8?B?ZsOzbw==?='
email.header.decode_header(...) -> [(b'bar', None), (b'hello', 'ascii'), 
(b'f\xc3\xb3o', 'utf-8')]
decode_header_to_string(...)-> 'barhellofóo'
---
Header value: 'plain string'
email.header.decode_header(...) -> [('plain string', None)]
decode_header_to_string(...)-> 'plain string'
---
Header value: 'foo=?blah?Q??='
email.header.decode_header(...) -> [(b'foo', None), (b'', 'blah')]
decode_header_to_string(...)-> 'foo'
---

--
nosy: +dlenski

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2021-12-12 Thread Irit Katriel


Irit Katriel  added the comment:

Reproduced on 3.11.

--
nosy: +iritkatriel
versions: +Python 3.10, Python 3.11, Python 3.9 -Python 3.4, Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2014-11-10 Thread R. David Murray

R. David Murray added the comment:

This is a duplicate of issue 6302.  Re-reading that issue (again), I'm not 
quite sure why we didn't fix it, but it may be too late to fix it now for 
backward compatibility reasons.

Since that issue strayed off into other topics, I'm going to leave this one 
open to consider whether or not we can/should fix this.  The new email API does 
avoid this problem, though.  Is there a reason you are choosing not to use the 
new API?

--
versions:  -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22833
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2014-11-10 Thread py.user

py.user added the comment:

R. David Murray wrote:
Is there a reason you are choosing not to use the new API?

My program is for Python 3.x. I need to decode wild headers to pretty unicode 
strings. Now, I do it by decode_header() and try...except for AttributeError, 
since a unicode string has no .decode() method.

I don't know what is new API, but I guess it's not compatible with Python 3.0.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22833
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2014-11-10 Thread R. David Murray

R. David Murray added the comment:

Certainly not with 3.0, but nobody in their right mind should be using that 
version any more :).

The new API for decoding headers is available as of Python 3.3, with additional 
new API features in 3.4.  See 

https://docs.python.org/3/library/email-examples.html#examples-using-the-provisional-api

for an example.  Note that although the API is 'provisional', I anticipate no 
non-trivial changes when it becomes final in 3.5.  (The only API change that 
has happened has been done such that you get warnings if you use it wrong in 
3.4, and is in a relatively obscure method (is_attachment).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22833
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2014-11-09 Thread py.user

New submission from py.user:

It depends on encoded part in the header, what email.header.decode_header() 
returns.
If the header has both raw part and encoded part, the function returns (bytes, 
None) for the raw part. But if the header has only raw part, the function 
returns (str, None) for it.

 import email.header
 
 s = 'abc=?koi8-r?q?\xc1\xc2\xd7?='
 email.header.decode_header(s)
[(b'abc', None), (b'\xc1\xc2\xd7', 'koi8-r')]
 
 s = 'abc'
 email.header.decode_header(s)
[('abc', None)]


There should be (bytes, None) for both cases.

--
components: Library (Lib), email
messages: 230932
nosy: barry, py.user, r.david.murray
priority: normal
severity: normal
status: open
title: The decode_header() function decodes raw part to bytes or str, depending 
on encoded part
versions: Python 3.3, Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22833
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22833] The decode_header() function decodes raw part to bytes or str, depending on encoded part

2014-11-09 Thread py.user

Changes by py.user bugzilla-mail-...@yandex.ru:


--
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22833
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com