[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-28 Thread John Paul Adrian Glaubitz


John Paul Adrian Glaubitz  added the comment:

> Awesome, thanks! I'll give it a try later today or tomorrow.

I have applied the patch and the problem seems to have been fixed. \o/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-27 Thread John Paul Adrian Glaubitz


John Paul Adrian Glaubitz  added the comment:

Awesome, thanks! I'll give it a try later today or tomorrow.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-27 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I fixed all suspicious places for which I found reproducers in PR 32137.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-27 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
keywords: +patch
pull_requests: +30217
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/32137

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-27 Thread John Paul Adrian Glaubitz


John Paul Adrian Glaubitz  added the comment:

Hi Serhiy!

> The simple fix is to add UnicodeEncodeError to "except LookupError". But 
> there may be other places where we can get a similar error. They should be 
> fixed too.

I would be very interested to test this as this issue currently blocks my use 
of offlineimap.

Would you mind creating a proof-of-concept patch for me if it's not too much 
work?

Thanks!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-27 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Sorry, I was puzzled by the exception type and missed details in a long 
traceback (I have issues with reading large texts). Thank you for your detailed 
report.

The simple fix is to add UnicodeEncodeError to "except LookupError". But there 
may be other places where we can get a similar error. They should be fixed too.

Alternatively we can do something when we get an invalid charset from the 
parsed data. I am not the email package expert, so I do not know what would be 
better in that context.

--
components: +Library (Lib)
versions: +Python 3.11 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-26 Thread Anders Kaseorg

Anders Kaseorg  added the comment:

It could and does, as quoted in my original report.

Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D

That’s a U+201D right double quotation mark.

This is not a valid charset for the charset of course, but it seems like the 
code was intended to handle an invalid charset value without crashing, so it 
should also handle an invalid charset charset (despite the absurdity of the 
entire concept of a charset charset).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-26 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

It is interesting that you get an UnicodeEncodeError when try to decode. Could 
the charser name contain non-ascii characters?

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-03-26 Thread Martin Dengler


Change by Martin Dengler :


--
nosy: +mdengler

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2022-01-12 Thread John Paul Adrian Glaubitz


John Paul Adrian Glaubitz  added the comment:

I'm running into exactly this issue when using 'offlineimap' which is written 
in Python.

--
nosy: +glaubitz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2021-02-26 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
versions:  -Python 3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2021-02-26 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
type:  -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43323] UnicodeEncodeError: surrogates not allowed when parsing invalid charset

2021-02-25 Thread Anders Kaseorg

New submission from Anders Kaseorg :

We ran into a UnicodeEncodeError exception using email.parser to parse this 
email 
,
 with full headers available in the raw archive 
.  The 
offending header is hilariously invalid:

Content-Type: text/plain; charset*=utf-8”''utf-8%E2%80%9D

but I’m filing an issue since the parser is intended to be robust against 
invalid input.  Minimal reproduction:

>>> import email, email.policy
>>> email.message_from_bytes(b"Content-Type: text/plain; 
>>> charset*=utf-8\xE2\x80\x9D''utf-8%E2%80%9D", policy=email.policy.default)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.10/email/__init__.py", line 46, in 
message_from_bytes
return BytesParser(*args, **kws).parsebytes(s)
  File "/usr/local/lib/python3.10/email/parser.py", line 123, in parsebytes
return self.parser.parsestr(text, headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/local/lib/python3.10/email/parser.py", line 57, in parse
return feedparser.close()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 187, in close
self._call_parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 180, in _call_parse
self._parse()
  File "/usr/local/lib/python3.10/email/feedparser.py", line 256, in _parsegen
if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/local/lib/python3.10/email/message.py", line 578, in 
get_content_type
value = self.get('content-type', missing)
  File "/usr/local/lib/python3.10/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
  File "/usr/local/lib/python3.10/email/policy.py", line 163, in 
header_fetch_parse
return self.header_factory(name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 608, in 
__call__
return self[name](name, value)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 196, in __new__
cls.parse(value, kwds)
  File "/usr/local/lib/python3.10/email/headerregistry.py", line 453, in parse
kwds['decoded'] = str(parse_tree)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in 
__str__
return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 126, in 

return ''.join(str(x) for x in self)
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 798, in 
__str__
for name, value in self.params:
  File "/usr/local/lib/python3.10/email/_header_value_parser.py", line 783, in 
params
value = value.decode(charset, 'surrogateescape')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5-7: 
surrogates not allowed

--
components: email
messages: 387685
nosy: andersk, barry, r.david.murray
priority: normal
severity: normal
status: open
title: UnicodeEncodeError: surrogates not allowed when parsing invalid charset
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com