[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Eryk Sun

Eryk Sun  added the comment:

Rafael, I was discussing code_page_decode() and code_page_encode() both as an 
alternative for compatibility with other programs and also to explore how 
MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain 
best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not 
exhibit "best fit" behavior. I don't even know what that would mean in the 
context of decoding. 

With the exception of one change to code page 1255, the definitions that you're 
looking to add are just for the C1 controls and private use area codes, which 
are not meaningful. Windows uses these arbitrary definitions to be able to 
roundtrip between the system ANSI and Unicode APIs.

Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page 
codec. For example:

>>> _winapi.GetACP()
1252

>>> '\x81\x8d\x8f\x90\x9d'.encode('ansi')
b'\x81\x8d\x8f\x90\x9d'

Best-fit encode "α" in code page 1252 [1]:

>>> 'α'.encode('ansi', 'replace')
b'a'

In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the 
only change that I think is really worthwhile because the unicode.org data has 
it wrong. You can get the proper character name for the comment using the 
unicodedata module:

>>> print(unicodedata.name('\u05ba'))
HEBREW POINT HOLAM HASER FOR VAV

I'm +0 in favor of leaving the mappings undefined where Windows completes 
legacy single-byte code pages by using C1 control codes and private use area 
codes. It would have been fine if Python's code-page encodings had always been 
based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's 
reasonable. 

Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would 
rather that the 'replace' handler for code_page_encode() used the replacement 
character (U+FFFD) or system default character. But the world is not ideal; the 
system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python 
2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18:

>>> os.listdir(u'.')
[u'\u03b1']

>>> os.listdir('.')
['a']

---

[1] 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Rafael Belo


Rafael Belo  added the comment:

Eryk 

Regarding the codecsmodule.c i don't really know its inner workings and how it 
is connected to other modules, and as of it, changes on that level for this use 
case are not critical. But it is nice to think and evaluate on that level too, 
since there might be some tricky situations on windows systems because of that 
grey zone. 

My proposal really aims to enhance the Lib/encodings/ module. And as Marc-Andre 
Lemburg advised, to only change those mappings in case of official corrections 
on the standard itself. Now i think that really following those standards 
"strictly" seems to be a good idea. 

On top of that, adding them under different naming seems like a better idea 
anyway, since those standards can be seen as different if you take a strict 
look at the Unicode definitions. Adding them would suffice for the needs that 
might arise, would still allow for catching mismatched encodings, and can even 
be "backported" to older python versions.

I will adjust the PR accordingly to these comments, thanks for the feedback!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Eryk Sun

Eryk Sun  added the comment:

> From Eryk's description it sounds like we should always add 
> WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() 
> in order to make sure it doesn't use best fit variants 
> unless explicitly requested.

The concept of a "best fit" encoding is unrelated to decoding with 
MultiByteToWideChar(). By default WideCharToMultiByte() best-fit encodes some 
otherwise unmapped ordinals to characters in the code page that have similar 
glyphs. This doesn't round trip (e.g. "α" -> b"a" -> "a"). The 
WC_NO_BEST_FIT_CHARS flag prevents this behavior. code_page_encode() uses 
WC_NO_BEST_FIT_CHARS for legacy encodings, unless the "replace" error handler 
is used.

Windows maps every value in single-byte ANSI code pages to a Unicode ordinal, 
which round trips between MultiByteToWideChar() and WideCharToMultiByte(). 
Unless otherwise defined, a value in the range 0x80-0x9F is mapped to the 
corresponding ordinal in the C1 controls block. Otherwise values that have no 
legacy definition are mapped to a private use area (e.g. U+E000 - U+F8FF). 

There is no option to make MultiByteToWideChar() fail for byte values that map 
to a C1 control code. But mappings to the private use area are strictly 
invalid, and MultiByteToWideChar() will fail in these cases if the flag 
MB_ERR_INVALID_CHARS is used. code_page_decode() always uses this flag, but to 
reliably fail one needs to pass final=True, since the codec doesn't know it's a 
single-byte encoding. For example:

>>> codecs.code_page_decode(1253, b'\xaa', 'strict')
('', 0)

>>> codecs.code_page_decode(1253, b'\xaa', 'strict', True)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'cp1253' codec can't decode bytes in position 0--1: 
No mapping for the Unicode character exists in the target code page.

Here are the mappings to the private use area in the single-byte "ANSI" code 
pages:

1255 Hebrew
0xD9U+F88D
0xDAU+F88E
0xDBU+F88F
0xDCU+F890
0xDDU+F891
0xDEU+F892
0xDFU+F893
0xFBU+F894
0xFCU+F895
0xFFU+F896

Note that 0xCA is defined as the Hebrew character U+05BA [1]. The definition is 
missing in the unicode.org data and Python's "cp1255" encoding.

874 Thai
0xDBU+F8C1
0xDCU+F8C2
0xDDU+F8C3
0xDEU+F8C4
0xFCU+F8C5
0xFDU+F8C6
0xFEU+F8C7
0xFFU+F8C8

1253 Greek
0xAAU+F8F9
0xD2U+F8FA
0xFFU+F8FB

1257 Baltic
0xA1U+F8FC
0xA5U+F8FD

There's no way to get these private use area results from code_page_decode(), 
but code_page_encode() allows them. For example:

>>> codecs.code_page_encode(1253, '\uf8f9')[0]
b'\xaa'

---

[1] https://en.wikipedia.org/wiki/Windows-1255

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Just to be clear: The Python code page encodings are (mostly) taken from the 
unicode.org set of mappings (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/). 
This is our standards body for such mappings, where possible. In some cases, 
the Unicode consortium does not provide such mappings and we resort to other 
standards (ISO, commonly used mapping files in OSes, Wikipedia, etc).

Changes to the existing mapping codecs should only be done in case corrections 
are applied to the mappings under those names by the standard bodies.

If you want to add variants such as the best fit ones from MS, we'd have to add 
them under a different name, e.g. bestfit1252 (see 
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/).

Otherwise, interop with other systems would no longer.

>From Eryk's description it sounds like we should always add 
>WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar() in order to make 
>sure it doesn't use best fit variants unless explicitly requested.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Rafael Belo

Rafael Belo  added the comment:

As encodings are indeed a complex topic, debating this seems like a necessity. 
I researched this topic when i found an encoding issue regarding a mysql 
connector: https://github.com/PyMySQL/mysqlclient/pull/502

In MySQL itself there is a mislabel of "latin1" and "cp1252",  what mysql calls 
"latin1" presents the behavior of cp1252. As Inada Naoki pointed out:

"""
See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html

MySQL's latin1 is the same as the Windows cp1252 character set. This means it 
is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers 
Authority) latin1, except that IANA latin1 treats the code points between 0x80 
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign 
characters for those positions. For example, 0x80 is the Euro sign. For the 
“undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 
0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

So latin1 in MySQL is actually cp1252.
"""

You can verify this by passing the byte 0x80 through it to get the string 
representation, a quick test i find useful:

On MYSQL: 
select convert(unhex('80') using latin1); -- -> returns "€"

On Postgresql: 
select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€"
select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control 
character "0xc2 0x80"

I decided to try to fix this behavior on python because i always found it to be 
a little odd to receive errors in those codepoints. A discussion i particularly 
find useful is this one: 
https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent

Which i think they didn't notice the "WindowsBestFit" folder at:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/

Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, 
created a tool to generate the windows encodings based on calls to 
"MultiByteToWideChar" which indeed generates the same mapping available on the 
unicode website, i've attached the file generated by it. 


Since there might be legacy systems which rely on this "specific" behavior, i 
don't think "back-porting" this update to older python versions is a good idea. 
That is the reason i think this should come in new versions, and treated as a 
"new behavior".

The benefit i see in updating this is to prevent even further confusion, with 
the expected behavior when dealing with those encodings.

--
Added file: https://bugs.python.org/file50282/cp1252_from_genwincodec.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Eryk Sun

Eryk Sun  added the comment:

> in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", 
> whereas in bestfit1252, they map to \u0081 \u008d \u008f 
> \u0090 \u009d respectively

This is the normal mapping in Windows, not a best-fit encoding. Within Windows, 
you can access the native encoding via codecs.code_page_encode() and 
codecs.code_page_decode(). For example:

>>> codecs.code_page_encode(1252, '\x81\x8d\x8f\x90\x9d')[0]
b'\x81\x8d\x8f\x90\x9d'

>>> codecs.code_page_decode(1252, b'\x81\x8d\x8f\x90\x9d')[0]
'\x81\x8d\x8f\x90\x9d'

WinAPI WideCharToMultiByte() uses a best-fit encoding unless the flag 
WC_NO_BEST_FIT_CHARS is passed. For example, with code page 1252, Greek "α" is 
best-fit encoded as Latin b"a". code_page_encode() uses the native best-fit 
encoding when the "replace" error handler is specified. For example:

>>> codecs.code_page_encode(1252, 'α', 'replace')[0]
b'a'

Regarding Python's encodings, if you need a specific mapping to match Windows, 
I think this should be discussed on a case by case basis. I see no benefit to 
supporting a mapping such as "\x81" <-> b"\x81" in code page 1252. That it's 
not mapped in Python is possibly a small benefit, since to some extent this 
helps to catch a mismatched encoding. For example, code page 1251 (Cyrilic) 
maps ordinal b"\x81" to "Ѓ" (i.e. "\u0403").

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Steve Dower


Steve Dower  added the comment:

Thanks for the PR.

Just wanted to acknowledge that we've seen it. Unfortunately, I'm not feeling 
confident to take this change right now - encodings are a real minefield, and 
we need to think through the implications. It's been a while since I've done 
that, so could take some time.

Unless one of the other people who have spent time working on this comes in and 
says they've thought it through and this is the best approach. In which case 
I'll happily trust them :)

--
nosy: +eryksun, serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-06 Thread Roundup Robot


Change by Roundup Robot :


--
keywords: +patch
nosy: +python-dev
nosy_count: 8.0 -> 9.0
pull_requests: +26615
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/28189

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-06 Thread Rafael Belo

New submission from Rafael Belo :

There is a mismatch in specification and behavior in some windows encodings.

Some older windows codepages specifications present "UNDEFINED" mapping, 
whereas in reality, they present another behavior which is updated in a section 
named "bestfit".

For example CP1252 has a corresponding bestfit1525: 
CP1252: 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
bestfit1525: 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt


>From which, in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", 
>whereas in bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d 
>respectively. 

In the Windows API, the function 'MultiByteToWideChar' exhibits the bestfit1252 
behavior.


This issue and PR proposes a correction for this behavior, updating the windows 
codepages where some code points where defined as "UNDEFINED" to the 
corresponding bestfit mapping. 


Related issue: https://bugs.python.org/issue28712

--
components: Demos and Tools, Library (Lib), Unicode, Windows
messages: 401181
nosy: ezio.melotti, lemburg, paul.moore, rafaelblsilva, steve.dower, 
tim.golden, vstinner, zach.ware
priority: normal
severity: normal
status: open
title: Windows cp encodings "UNDEFINED" entries update
type: behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com