[issue39910] os.ftruncate on Windows should be sparse
New submission from Mingye Wang : Consider this interaction: cmd> echo > 1.txt cmd> python -c "__import__('os').truncate('1.txt', 1024 ** 3)" cmd> fsutil sparse queryFlag 1.txt Not only takes a long time as is typical for a zero-write, but also reports non-sparse as an actual write would suggest. This is because internally, _chsize_s and friends enlarges files using a loop.[1] [1]: https://github.com/leelwh/clib/blob/master/c/chsize.c On Unix systems, ftruncate for enlarging is described as "... as if the extra space is zero-filled", but this is not to be taken literally. In practice, sparse files are used whenever available (GNU dd expects that) and people do expect the operation to be very fast without a lot of real writes. A FreeBSD bug exists around how ftruncate is too slow on UFS. The aria2 downloader gives a good example of how to truncate into a sparse file on Windows.[2] First a FSCTL_SET_SPARSE control is issued, and then a seek + SetEndOfFile would finish the job. Of course, a lseek to the end would be required to first determine the size of the file, so we know whether we are enlarging (sparse) or shrinking (don't sparse). [2]: https://github.com/aria2/aria2/blob/master/src/AbstractDiskWriter.cc#L507 -- components: Library (Lib) messages: 363717 nosy: Artoria2e5, steve.dower priority: normal severity: normal status: open title: os.ftruncate on Windows should be sparse versions: Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue39910> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39732] plistlib should export UIDs in XML like Apple does
Change by Mingye Wang : -- keywords: +patch pull_requests: +17987 stage: -> patch review pull_request: https://github.com/python/cpython/pull/18622 ___ Python tracker <https://bugs.python.org/issue39732> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39732] plistlib should export UIDs in XML like Apple does
New submission from Mingye Wang : Although there is no native UID type in Apple's XML format, Apple's NSKeyedArchiver still works with it because it converts the UID to a dict of {"CF$UID": int(some_uint64_val)}. Plistlib should do the same. For a sample, see https://github.com/apple/swift-corelibs-foundation/blob/2a5bc4d8a0b073532e60410682f5eb8f00144870/Tests/Foundation/Resources/NSKeyedUnarchiver-ArrayTest.plist. -- components: Library (Lib) messages: 362513 nosy: Artoria2e5 priority: normal severity: normal status: open title: plistlib should export UIDs in XML like Apple does type: behavior versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue39732> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Mingye Wang added the comment: b'\x80'.decode('cp936') is still broken on python 3.7. Working on a PR. -- versions: +Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Changes by Mingye Wang <arthur200...@gmail.com>: -- versions: -Python 3.3, Python 3.4 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28693] No EUDC (HKSCS) support in Windows cp950
Mingye Wang added the comment: Windows cp950's EUDC<->PUA mapping is not specific to HKSCS. -- title: No HKSCS support in Windows cp950 -> No EUDC (HKSCS) support in Windows cp950 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28693> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: > Codecs are strict by default in Python. Call MultiByteToWideChar() with the > MB_ERR_INVALID_CHARS flag as Python does. Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would perform the "best fit" behavior described in the BestFit files, which is not marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in these files and requires checking for existence of reverse mapping[1]. When MB_ERR_INVALID_CHARS is set, Windows would perform a strict check. [2]: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to invoke "best fit" behavior? It might be useful for interoperating with common Windows programs and users. (Implementation for other platforms can be constructed from WindowsBestFit charts, but it might be too large relative to its usefulness.) -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: ... On the other hand, I am happy to use these Win32 functions if they are faster, but still the table should be made correct in the first place. (See also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code pages.) -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Mingye Wang added the comment: Update: the test script at issue28712 can be modified to show this issue too. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28693] No HKSCS support in Windows cp950
Mingye Wang added the comment: Update: the test script at issue28712 can be modified to show this issue too. -- components: +Windows nosy: +paul.moore, steve.dower, tim.golden, zach.ware ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28693> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Changes by Mingye Wang <arthur200...@gmail.com>: -- components: +Windows nosy: +paul.moore, steve.dower, tim.golden, zach.ware ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: Yes, it's a table issue. My suggested fix is to replace them all with WindowsBestFit tables, where MS currently redirects https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old "WINDOWS" tables appear abandoned since long ago. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Changes by Mingye Wang <arthur200...@gmail.com>: Removed file: http://bugs.python.org/file45502/pycp.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: The output is already attached as win10_14959_py36.txt. PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 too (attached with comment). -- Added file: http://bugs.python.org/file45503/pycp_ctypes.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Changes by Mingye Wang <arthur200...@gmail.com>: Removed file: http://bugs.python.org/file45497/pycp.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Changes by Mingye Wang <arthur200...@gmail.com>: Added file: http://bugs.python.org/file45502/pycp.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: Ugh... This is weird. Attached is a correct version use Python 3.6's 'code page' methods. I have modified the script a little to make sure it runs on Py3. -- Added file: http://bugs.python.org/file45501/win10_14959_py36.txt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Mingye Wang added the comment: > Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``. ... but since Cygwin packagers did not enable Win32 APIs for their build, I cannot test the script directly. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
Changes by Mingye Wang <arthur200...@gmail.com>: Added file: http://bugs.python.org/file45498/windows10_14959.txt ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28712] Non-Windows mappings for a couple of Windows code pages
New submission from Mingye Wang: Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what Windows does. Attached is a script that tests for this behavior. (These two bytes are not necessary the only problems, but for sure they are the most widespread and famous ones. Again, refer to Unicode best fit for something that works.) This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is known since long ago[1]. Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``. [1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy -- components: Unicode files: pycp.py messages: 280914 nosy: Artoria2e5, ezio.melotti, haypo priority: normal severity: normal status: open title: Non-Windows mappings for a couple of Windows code pages type: behavior versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 Added file: http://bugs.python.org/file45497/pycp.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Mingye Wang added the comment: Also, go to ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt for MS reference. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24117] Wrong range checking in GB18030 decoder.
Mingye Wang added the comment: Just FYI, cp950 0xC6A1 (\uf6b1) is found in current WindowsBestFit: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt -- nosy: +Artoria2e5 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24117> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28693] No HKSCS support in Windows cp950
New submission from Mingye Wang: Python's cp950 implementation lacks support for HKSCS ('big5hkscs'). This support, which maps HKSCS Big5-EUDC code points to Unicode PUA code points algorithmically, is found in Windows Vista+ as well as an update for XP. An experiment session is shown below. I will use '2>>>' to denote a Win32 build of Python 2.7.10 running under a console window set to cp950 (via chcp), and '3>>>' to denote a Python 3.4.3 build running under Cygwin's UTF-8 mintty. HKSCS-2008's table is used http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt for a list of HKSCS characters; note though, its non-PUA mappings are not found in Windows. Let's start with the first character in that list. 3>>> u'\u43F0' '䏰' 3>>> print(u'\uF266') # provisional PUA 3>>> u'\u43F0'.encode('cp950') # FAIL 3>>> u'\uF266'.encode('cp950') # FAIL 3>>> u'\u43F0'.encode('hkscs') b'\x87@' 3>>> u'\uF266'.encode('hkscs') # FAIL` These experiments above show how Python 3 handles HKSCS characters, and how U+43F0 should normally be encoded. Now let's switch to Windows console, which would be using Windows' decode-to-Unicode routine for cp950. 2>>> print b'\x87@' Let's try to identify this character: 3>>> u'' '\uf266' So indeed there is some sort of HKSCS going on. But note what Windows has is really not any kind of new HKSCS: > Big5 ucs93 ucs00 ucs03 + 1-6 > 876B 9734 97349734 > 876C F292 F292 27BEF > 876D 5BDB 5BDB5BDB 2>>> print b'\x87\x6b,\x87\x6c,\x87\x6d' ,, 3>>> u',,' '\uf291,\uf292,\uf293' Just as for all other code pages, you can always find Microsoft's mapping at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt. If you are uncomfortable with adding a whole new table and wasting space (this is done for hkscs btw), use the algorithmic mapping at https://en.wikipedia.org/wiki/Code_page_950. -- components: Unicode messages: 280811 nosy: Artoria2e5, ezio.melotti, haypo priority: normal severity: normal status: open title: No HKSCS support in Windows cp950 type: behavior versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28693> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
Mingye Wang added the comment: The "join the web people" solution should look like this: $ diff -Naurp a/_codecs_cn.c b/_codecs_cn.c --- a/_codecs_cn.c2016-10-09 14:24:04.675111500 -0700 +++ b/_codecs_cn.c2016-10-09 14:27:06.600961500 -0700 @@ -128,6 +128,12 @@ ENCODER(gbk) continue; } +if (c == 0x20AC) { /* cp936, or web GBK */ +WRITEBYTE1((unsigned char)0x80); +NEXT(1, 1); +continue; +} + if (c > 0x) return 1; @@ -159,6 +165,12 @@ DECODER(gbk) NEXT_IN(1); continue; } + +if (c == 0x80) { /* cp936, or web GBK */ +OUTCHAR(0x20AC); +NEXT_IN(1); +continue; +} REQUIRE_INBUF(2); It should be mostly safe as this character is not previously defined in Python's GBK implementation. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24036] GB2312 codec is using a wrong covert table
Mingye Wang added the comment: > Advice for final user: This seems something worthy of adding to the codecs doc as a footnote. Perhaps something like "(deprecated) ... gb2312 is an obsolete encoding from the 1980s. Use gbk or gb18030 instead." will do. > libiconv-1.14 is also using the wrong version. Just a side note on the right/wrongfulness of libiconv: I have reported the GB18030 incompatibility as a libiconv bug.[1] From the replies, I learnt that 1) what libiconv is using currently is a then-official mapping published on ftp.unicode.org; 2) vendor implementations of gb2312 differed historically. I have updated the corresponding section[2] on Wikipedia to include these old references. [1]: https://lists.gnu.org/archive/html/bug-gnu-libiconv/2016-09/msg4.html [2]: https://en.wikipedia.org/wiki/GB_2312#Two_implementations_of_GB2312 Still, being old and common does not necessarily mean being correct, as Ma Lin have demonstrated by showing the character semantics. To reflect this in a better-supported manner, I have added names for the glyphs in question from GB2312-80 to [2]. -- nosy: +Artoria2e5 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24036> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28343] Bad encoding alias cp936 -> gbk: euro sign
New submission from Mingye Wang (Arthur2e5): Microsoft's cp936 defines a euro sign at 0x80, but Python would kick the bucket when asked to do something like `u'\u20ac'.encode('cp936')`. This may break things for zh-hans-cn windows users who wants to put a euro sign in their file name (if they insist on using a non-unicode str for open() in py2, well.) By looking at the codecs documentation, 'cp936' appears to be an alias for the GBK encoder, which by itself has been a very ambiguous name and subject to confusion -- The name "GBK" might refer to any of the four commonly-known members of the family of EUC-CN (gb2312) extensions that has full coverage of Unicode 1.1 CJK Unified Ideographs block: 1) The original GBK. Rust-Encoding says that it's in a normative annex of GB13000.1-1993, but the closest thing I can find in my archive.org copy of that standard is an annex on an EUC (GB/T 2311) UCS. 2) IANA GBK, or Microsoft cp936. This is the one with the euro sign I am looking for. 3) GBK 1.0, a recommendation from the official standardization committees based on cp936. It's roughly cp936 without the euro sign but with some additional 95 PUA code points. 4) W3C TR GBK. This GBK is basically gb18030-2005 without four-byte UTF, and with the euro sign. Roughly a union of 2) and 3) with some PUA code points moved into the right place. Looking at Modules/cjkcodecs/_codecs_cn.c @ 104259:36b052adf5a7, Python seems to be doing either 1) or 3). For a quick fix you can just make an additional cp936 encoding around the gbk encoding that handles U+20AC; for some excitement (of potentially breaking stuff) you can join the web people and use either 2) or 4). -- components: Unicode messages: 277925 nosy: Mingye Wang (Arthur2e5), ezio.melotti, haypo priority: normal severity: normal status: open title: Bad encoding alias cp936 -> gbk: euro sign type: behavior versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28343> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com