[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Eryk Sun
Eryk Sun added the comment: Rafael, I was discussing code_page_decode() and code_page_encode() both as an alternative for compatibility with other programs and also to explore how MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain best-fit mappings, which do not

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Rafael Belo
Rafael Belo added the comment: Eryk Regarding the codecsmodule.c i don't really know its inner workings and how it is connected to other modules, and as of it, changes on that level for this use case are not critical. But it is nice to think and evaluate on that level too, since there

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Eryk Sun
Eryk Sun added the comment: > From Eryk's description it sounds like we should always add > WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() > in order to make sure it doesn't use best fit variants > unless explicitly requested. The concept of a "best fit" encoding is unrelated to

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Just to be clear: The Python code page encodings are (mostly) taken from the unicode.org set of mappings (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/). This is our standards body for such mappings, where possible. In some cases, the Unicode

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Rafael Belo
Rafael Belo added the comment: As encodings are indeed a complex topic, debating this seems like a necessity. I researched this topic when i found an encoding issue regarding a mysql connector: https://github.com/PyMySQL/mysqlclient/pull/502 In MySQL itself there is a mislabel of "latin1"

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Eryk Sun
Eryk Sun added the comment: > in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", > whereas in bestfit1252, they map to \u0081 \u008d \u008f > \u0090 \u009d respectively This is the normal mapping in Windows, not a best-fit encoding. Within Windows, you can access the native

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Steve Dower
Steve Dower added the comment: Thanks for the PR. Just wanted to acknowledge that we've seen it. Unfortunately, I'm not feeling confident to take this change right now - encodings are a real minefield, and we need to think through the implications. It's been a while since I've done that,

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-06 Thread Roundup Robot
Change by Roundup Robot : -- keywords: +patch nosy: +python-dev nosy_count: 8.0 -> 9.0 pull_requests: +26615 stage: -> patch review pull_request: https://github.com/python/cpython/pull/28189 ___ Python tracker

[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-06 Thread Rafael Belo
New submission from Rafael Belo : There is a mismatch in specification and behavior in some windows encodings. Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit". For example