[issue28712] Non-Windows mappings for a couple of Windows code pages

2021-03-08 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2021-03-04 Thread Larry Hastings


Change by Larry Hastings :


--
nosy:  -larry

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2021-03-04 Thread Eryk Sun


Change by Eryk Sun :


--
versions: +Python 3.10, Python 3.8, Python 3.9 -Python 2.7, Python 3.5, Python 
3.6, Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-17 Thread Eryk Sun

Eryk Sun added the comment:

Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any 
undefined codes would be decoded using the codepage's default Unicode 
character. But for single-byte codepages in the range above 0x9F, Windows 
instead maps undefined codes to the Private Use Area (PUA). For example, using 
decode() from above:

ERROR_NO_UNICODE_TRANSLATION = 0x0459
codepages = 857, 864, 874, 1253, 1255, 1257
for cp in codepages:
undefined = []
for i in range(256):
b = bytes([i])
try:
decode(cp, b)
except OSError as e:
if e.winerror == ERROR_NO_UNICODE_TRANSLATION:
c = decode(cp, b, False)
undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c)))
print(cp, *undefined, sep=', ')

output:

857, d5=>f8bb, e7=>f8bc, f2=>f8bd
864, a6=>f8be, a7=>f8bf, ff=>f8c0
874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6, 
fe=>f8c7, ff=>f8c8
1253, aa=>f8f9, d2=>f8fa, ff=>f8fb
1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892, 
df=>f893, fb=>f894, fc=>f895, ff=>f896
1257, a1=>f8fc, a5=>f8fd

Do you think Python's 'replace' handler should prevent adding the 
MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is 
that the PUA code can be encoded back to the original byte value:

>>> codecs.code_page_encode(1257, '\uf8fd')
(b'\xa5', 1)

> cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.

Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag 
isn't used:

>>> decode(932, b'\xa0\xfd\xfe\xff', False)
'\uf8f0\uf8f1\uf8f2\uf8f3'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-17 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Thank you Eryk. That is what I want. I just missed that code_page_decode() 
returns a tuple.

Seems Windows maps undefined codes to Unicode characters if they are in the 
range 0x80-0x9f and makes an error if they are outside of this range. But if 
the code starts multibyte sequence, the single byte is an error even if it is 
in the range 0x80-0x9f (codepages 932, 949, 950).

This could be emulated by either decoding with errors='surrogateescape' and 
postprocessing the result (replace '\udc80'-'\udc9f' with '\x80'-'\x9f' and 
handle '\udca0'-'\udcff' as error) or writing custom error handler that does 
the job (but perhaps needed several error handlers corresponding 'strict', 
'replace', 'ignore', etc). Adding a new codec of cause is an option too.

There are few other minor differences between Python and Windows:

cp864: On Windows 0x25 is mapped to '%' (U+0025) instead of '٪' (U+066A).
cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
cp1255: 0xCA is mapped to U+05BA instead of be undefined.

The first two differences can be handled by postprocessing, the latter needs 
changing the codec.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread STINNER Victor

STINNER Victor added the comment:

Windows API doc is not easy to understand. I wrote this doc when I fixed
code pages in Python 3:
http://unicodebook.readthedocs.io/operating_systems.html#windows

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Eryk Sun added the comment:

The ANSI and OEM codepages are conveniently supported on a Windows system as 
the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by 
the 'replace' error handler (see the encode_code_page_flags function in 
Objects/unicodeobject.c). For other Windows codepages, while it's not as 
convenient, you can use codecs.code_page_encode. For example:

>>> codecs.code_page_encode(1252, 'α', 'replace')
(b'a', 1)

For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte 
codepages because they map every byte. It only affects decoding byte sequences 
that are invalid in multibyte codepages such as 932 and 65001. Without this 
flag, invalid sequences are silently decoded as the codepage's Unicode default 
character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), 
and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS 
almost always, except not for UTF-7 (see the decode_code_page_flags function). 
So its 'replace' error handling is completely Python's own implementation. For 
example:

MultiByteToWideChar without MB_ERR_INVALID_CHARS:

>>> decode(932, b'\xe05', strict=False)
'\u30fb'

versus code_page_decode:

>>> codecs.code_page_decode(932, b'\xe05', 'replace', True)
('\ufffd5', 2)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

> Codecs are strict by default in Python. Call MultiByteToWideChar() with the 
> MB_ERR_INVALID_CHARS flag as Python does.

Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would 
perform the "best fit" behavior described in the BestFit files, which is not 
marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in 
these files and requires checking for existence of reverse mapping[1]. When 
MB_ERR_INVALID_CHARS is set, Windows would perform a strict check.
  [2]: 
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to 
invoke "best fit" behavior? It might be useful for interoperating with common 
Windows programs and users. (Implementation for other platforms can be 
constructed from WindowsBestFit charts, but it might be too large relative to 
its usefulness.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Eryk Sun added the comment:

I rewrote it using the csv module since I can't remember the escaping rules.

--
Added file: http://bugs.python.org/file45511/codepage_table.csv

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Changes by Eryk Sun :


Removed file: http://bugs.python.org/file45510/codepage_table.csv

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Eryk Sun added the comment:

I don't think the 2nd tuple element is useful when decoding a single byte. It 
either works or it doesn't, such as failing for non-ASCII bytes with multibyte 
codepages such as 932 and 950. 

I'm attaching the output from the following, which you should be able to open 
in a spreadsheet:

import codecs
codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950,
 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258]
for cp in codepages:
table = []
for i in range(256):
try:
c = codecs.code_page_decode(cp, bytes([i]), None, True)
c = ascii(c[0])
except Exception:
c = None
table.append(c)
print(cp, *table, sep=',')

--
Added file: http://bugs.python.org/file45510/codepage_table.csv

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This would be helpful too if every byte is decoded to exactly 1 character.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Eryk Sun added the comment:

How about just the ASCII repr of the 256 decoded characters in CSV? I don't 
think the list of 2-tuple results is useful. For these single-byte codepages 
it's always 1 byte consumed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Thanks Eryk. Could you please run following script and attach the output?

import codecs
codepages = [424, 856, 857, 864, 869, 874, 932, 949, 950, 1250, 1251, 1252, 
1253, 1254, 1255, 1257, 1258]
for cp in codepages:
table = []
for i in range(256):
try:
c = codecs.code_page_decode(cp, bytes([i]), None, True)
except Exception:
c = None
table.append(c)
print(cp, ascii(table))

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Eryk Sun

Eryk Sun added the comment:

Serhiy, single-byte codepages map every byte value, even if it's just to a 
Unicode C1 control code [1]. 

For example:

import ctypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

MB_ERR_INVALID_CHARS = 0x0008

def mbtwc_errcheck(result, func, args):
if not result and args[-1]:
raise ctypes.WinError(ctypes.get_last_error())
return args

kernel32.MultiByteToWideChar.errcheck = mbtwc_errcheck

def decode(codepage, data, strict=True):
flags = MB_ERR_INVALID_CHARS if strict else 0
n = kernel32.MultiByteToWideChar(codepage, flags,
 data, len(data),
 None, 0)
buf = (ctypes.c_wchar * n)()
kernel32.MultiByteToWideChar(codepage, flags,
 data, len(data),
 buf, n)
return buf.value


codepages = [437, 874] + list(range(1250, 1259))
for cp in codepages:
print('cp%d:' % cp, ascii(decode(cp, b'\x81\x8d')))

Output:

cp437: '\xfc\xec'
cp874: '\x81\x8d'
cp1250: '\x81\u0164'
cp1251: '\u0403\u040c'
cp1252: '\x81\x8d'
cp1253: '\x81\x8d'
cp1254: '\x81\x8d'
cp1255: '\x81\x8d'
cp1256: '\u067e\u0686'
cp1257: '\x81\xa8'
cp1258: '\x81\x8d'

[1]: https://en.wikipedia.org/wiki/C0_and_C1_control_codes

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Codecs are strict by default in Python. Call MultiByteToWideChar() with the 
MB_ERR_INVALID_CHARS flag as Python does. You also could use 
_codecs.code_page_decode().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Ned Deily

Ned Deily added the comment:

I'm not qualified to offer a technical opinion on Windows matters like this so, 
for 3.6, I leave it to your discretion, Steve.  If you do decide to push this 
change, please do so before 3.6.0b4 on Monday.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Steve Dower

Steve Dower added the comment:

No idea which is faster, but the tables have better compatibility.

However, I'm not sure that changing the tables in already released versions is 
a great idea, since it could "corrupt" programs without warning. Adding the 
release managers to weigh in - my gut feel is that targeted table fixes plus 
validation tests are okay for 3.6 if we hurry, but are probably not suitable 
for 2.7 or 3.5.

--
nosy: +benjamin.peterson, larry, ned.deily

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

... On the other hand, I am happy to use these Win32 functions if they are 
faster, but still the table should be made correct in the first place. (See 
also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code 
pages.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Yes, it's a table issue. My suggested fix is to replace them all with 
WindowsBestFit tables, where MS currently redirects 
https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old 
"WINDOWS" tables appear abandoned since long ago.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Steve Dower

Steve Dower added the comment:

So is this a bug in the hardcoded encoding tables in Python? I briefly 
considered making them all use the OS functions, but then they'll be 
inconsistent with other platforms (where the tables should work fine).

Do you have a proposed fix? That will help illustrate where the problem is.

--
versions:  -Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
components: +Windows
nosy: +paul.moore, steve.dower, tim.golden, zach.ware

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang :


Removed file: http://bugs.python.org/file45502/pycp.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

The output is already attached as win10_14959_py36.txt.

PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 
too (attached with comment).

--
Added file: http://bugs.python.org/file45503/pycp_ctypes.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

What is the output of new script?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang :


Removed file: http://bugs.python.org/file45497/pycp.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang :


Added file: http://bugs.python.org/file45502/pycp.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Ugh... This is weird. Attached is a correct version use Python 3.6's 'code 
page' methods. I have modified the script a little to make sure it runs on Py3.

--
Added file: http://bugs.python.org/file45501/win10_14959_py36.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

It seems to me there is something wrong with your test. For example decoding 
b'\x81\x8d' from CP1251 (as well from any other codepage!) gives you 
u'\x81\x8d', but codes 0x81 and 0x8D are assigned to different characters: 'Ѓ' 
(U+0402) and 'Ќ' (U+040C).

0x810x0403  #CYRILLIC CAPITAL LETTER GJE
0x8D0x040C  #CYRILLIC CAPITAL LETTER KJE

[1] https://en.wikipedia.org/wiki/Windows-1251
[2] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT
[3] 
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1251.txt

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

Mingye Wang added the comment:

> Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.

... but since Cygwin packagers did not enable Win32 APIs for their build, I 
cannot test the script directly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

Changes by Mingye Wang :


Added file: http://bugs.python.org/file45498/windows10_14959.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

New submission from Mingye Wang:

Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what 
Windows does. Attached is a script that tests for this behavior. (These two 
bytes are not necessary the only problems, but for sure they are the most 
widespread and famous ones. Again, refer to Unicode best fit for something that 
works.)

This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is 
known since long ago[1]. Python 3.4.3 on Cygwin also fails 
``b'\x81\x8d'.encode('cp1252')``.
  [1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy

--
components: Unicode
files: pycp.py
messages: 280914
nosy: Artoria2e5, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Non-Windows mappings for a couple of Windows code pages
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7
Added file: http://bugs.python.org/file45497/pycp.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com