[issue39910] os.ftruncate on Windows should be sparse

2020-03-09 Thread Mingye Wang


New submission from Mingye Wang :

Consider this interaction:

cmd> echo > 1.txt
cmd> python -c "__import__('os').truncate('1.txt', 1024 ** 3)"
cmd> fsutil sparse queryFlag 1.txt

Not only takes a long time as is typical for a zero-write, but also reports 
non-sparse as an actual write would suggest. This is because internally, 
_chsize_s and friends enlarges files using a loop.[1]
  [1]: https://github.com/leelwh/clib/blob/master/c/chsize.c

On Unix systems, ftruncate for enlarging is described as "... as if the extra 
space is zero-filled", but this is not to be taken literally. In practice, 
sparse files are used whenever available (GNU dd expects that) and people do 
expect the operation to be very fast without a lot of real writes. A FreeBSD 
bug exists around how ftruncate is too slow on UFS.

The aria2 downloader gives a good example of how to truncate into a sparse file 
on Windows.[2] First a FSCTL_SET_SPARSE control is issued, and then a seek + 
SetEndOfFile would finish the job. Of course, a lseek to the end would be 
required to first determine the size of the file, so we know whether we are 
enlarging (sparse) or shrinking (don't sparse).
  [2]: https://github.com/aria2/aria2/blob/master/src/AbstractDiskWriter.cc#L507

--
components: Library (Lib)
messages: 363717
nosy: Artoria2e5, steve.dower
priority: normal
severity: normal
status: open
title: os.ftruncate on Windows should be sparse
versions: Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue39910>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39732] plistlib should export UIDs in XML like Apple does

2020-02-23 Thread Mingye Wang


Change by Mingye Wang :


--
keywords: +patch
pull_requests: +17987
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/18622

___
Python tracker 
<https://bugs.python.org/issue39732>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39732] plistlib should export UIDs in XML like Apple does

2020-02-23 Thread Mingye Wang


New submission from Mingye Wang :

Although there is no native UID type in Apple's XML format, Apple's 
NSKeyedArchiver still works with it because it converts the UID to a dict of 
{"CF$UID": int(some_uint64_val)}. Plistlib should do the same.

For a sample, see 
https://github.com/apple/swift-corelibs-foundation/blob/2a5bc4d8a0b073532e60410682f5eb8f00144870/Tests/Foundation/Resources/NSKeyedUnarchiver-ArrayTest.plist.

--
components: Library (Lib)
messages: 362513
nosy: Artoria2e5
priority: normal
severity: normal
status: open
title: plistlib should export UIDs in XML like Apple does
type: behavior
versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue39732>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2020-01-05 Thread Mingye Wang


Mingye Wang  added the comment:

b'\x80'.decode('cp936') is still broken on python 3.7. Working on a PR.

--
versions: +Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-11-24 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


--
versions:  -Python 3.3, Python 3.4

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28693] No EUDC (HKSCS) support in Windows cp950

2016-11-24 Thread Mingye Wang

Mingye Wang added the comment:

Windows cp950's EUDC<->PUA mapping is not specific to HKSCS.

--
title: No HKSCS support in Windows cp950 -> No EUDC (HKSCS)  support in Windows 
cp950

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28693>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

> Codecs are strict by default in Python. Call MultiByteToWideChar() with the 
> MB_ERR_INVALID_CHARS flag as Python does.

Great catch. Without MB_ERR_INVALID_CHARS or WC_NO_BEST_FIT_CHARS Windows would 
perform the "best fit" behavior described in the BestFit files, which is not 
marked explicitly (they didn't add '<< Best Fit Mapping' like in the readme) in 
these files and requires checking for existence of reverse mapping[1]. When 
MB_ERR_INVALID_CHARS is set, Windows would perform a strict check.
  [2]: 
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

By the way, will there be a 'mbcsbestfitreplace' error handler on Windows to 
invoke "best fit" behavior? It might be useful for interoperating with common 
Windows programs and users. (Implementation for other platforms can be 
constructed from WindowsBestFit charts, but it might be too large relative to 
its usefulness.)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

... On the other hand, I am happy to use these Win32 functions if they are 
faster, but still the table should be made correct in the first place. (See 
also issue28343 (936) and issue28693 (950) for problems with DBCS Chinese code 
pages.)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Update: the test script at issue28712 can be modified to show this issue too.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28693] No HKSCS support in Windows cp950

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Update: the test script at issue28712 can be modified to show this issue too.

--
components: +Windows
nosy: +paul.moore, steve.dower, tim.golden, zach.ware

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28693>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


--
components: +Windows
nosy: +paul.moore, steve.dower, tim.golden, zach.ware

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Yes, it's a table issue. My suggested fix is to replace them all with 
WindowsBestFit tables, where MS currently redirects 
https://msdn.microsoft.com/en-us/globalization/mt767590 visitors to. These old 
"WINDOWS" tables appear abandoned since long ago.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


Removed file: http://bugs.python.org/file45502/pycp.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

The output is already attached as win10_14959_py36.txt.

PS: after playing with ctypes, I got a version of pycp that works with Py < 3.3 
too (attached with comment).

--
Added file: http://bugs.python.org/file45503/pycp_ctypes.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


Removed file: http://bugs.python.org/file45497/pycp.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


Added file: http://bugs.python.org/file45502/pycp.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-16 Thread Mingye Wang

Mingye Wang added the comment:

Ugh... This is weird. Attached is a correct version use Python 3.6's 'code 
page' methods. I have modified the script a little to make sure it runs on Py3.

--
Added file: http://bugs.python.org/file45501/win10_14959_py36.txt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

Mingye Wang added the comment:

> Python 3.4.3 on Cygwin also fails ``b'\x81\x8d'.encode('cp1252')``.

... but since Cygwin packagers did not enable Win32 APIs for their build, I 
cannot test the script directly.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

Changes by Mingye Wang <arthur200...@gmail.com>:


Added file: http://bugs.python.org/file45498/windows10_14959.txt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28712] Non-Windows mappings for a couple of Windows code pages

2016-11-15 Thread Mingye Wang

New submission from Mingye Wang:

Mappings for 0x81 and 0x8D in multiple Windows code pages diverge from what 
Windows does. Attached is a script that tests for this behavior. (These two 
bytes are not necessary the only problems, but for sure they are the most 
widespread and famous ones. Again, refer to Unicode best fit for something that 
works.)

This problem is seen in Python 2.7.10 on Windows 10b14959, but apparently it is 
known since long ago[1]. Python 3.4.3 on Cygwin also fails 
``b'\x81\x8d'.encode('cp1252')``.
  [1]: https://ftfy.readthedocs.io/en/latest/#module-ftfy.bad_codecs.sloppy

--
components: Unicode
files: pycp.py
messages: 280914
nosy: Artoria2e5, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Non-Windows mappings for a couple of Windows code pages
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7
Added file: http://bugs.python.org/file45497/pycp.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-11-15 Thread Mingye Wang

Mingye Wang added the comment:

Also, go to 
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt
 for MS reference.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24117] Wrong range checking in GB18030 decoder.

2016-11-14 Thread Mingye Wang

Mingye Wang added the comment:

Just FYI, cp950 0xC6A1 (\uf6b1) is found in current WindowsBestFit: 
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

--
nosy: +Artoria2e5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24117>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28693] No HKSCS support in Windows cp950

2016-11-14 Thread Mingye Wang

New submission from Mingye Wang:

Python's cp950 implementation lacks support for HKSCS ('big5hkscs'). This 
support, which maps HKSCS Big5-EUDC code points to Unicode PUA code points 
algorithmically, is found in Windows Vista+ as well as an update for XP.

An experiment session is shown below. I will use '2>>>' to denote a Win32 build 
of Python 2.7.10 running under a console window set to cp950 (via chcp), and 
'3>>>' to denote a Python 3.4.3 build running under Cygwin's UTF-8 mintty. 
HKSCS-2008's table is used  
http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt
 for a list of HKSCS characters; note though, its non-PUA mappings are not 
found in Windows.

Let's start with the first character in that list.

3>>> u'\u43F0'
'䏰'
3>>> print(u'\uF266') # provisional PUA

3>>> u'\u43F0'.encode('cp950') # FAIL
3>>> u'\uF266'.encode('cp950') # FAIL
3>>> u'\u43F0'.encode('hkscs')
b'\x87@'
3>>> u'\uF266'.encode('hkscs') # FAIL`

These experiments above show how Python 3 handles HKSCS characters, and how 
U+43F0 should normally be encoded. Now let's switch to Windows console, which 
would be using Windows' decode-to-Unicode routine for cp950.

2>>> print b'\x87@'


Let's try to identify this character:

3>>> u''
'\uf266'

So indeed there is some sort of HKSCS going on. But note what Windows has is 
really not any kind of new HKSCS:

> Big5   ucs93  ucs00   ucs03 + 1-6
> 876B   9734   97349734
> 876C   F292   F292   27BEF
> 876D   5BDB   5BDB5BDB

2>>> print b'\x87\x6b,\x87\x6c,\x87\x6d'
,,
3>>> u',,'
'\uf291,\uf292,\uf293'

Just as for all other code pages, you can always find Microsoft's mapping at 
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt.
 If you are uncomfortable with adding a whole new table and wasting space (this 
is done for hkscs btw), use the algorithmic mapping at 
https://en.wikipedia.org/wiki/Code_page_950.

--
components: Unicode
messages: 280811
nosy: Artoria2e5, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: No HKSCS support in Windows cp950
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28693>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-10-09 Thread Mingye Wang

Mingye Wang added the comment:

The "join the web people" solution should look like this:

$ diff -Naurp a/_codecs_cn.c b/_codecs_cn.c
--- a/_codecs_cn.c2016-10-09 14:24:04.675111500 -0700
+++ b/_codecs_cn.c2016-10-09 14:27:06.600961500 -0700
@@ -128,6 +128,12 @@ ENCODER(gbk)
 continue;
 }

+if (c == 0x20AC) { /* cp936, or web GBK */
+WRITEBYTE1((unsigned char)0x80);
+NEXT(1, 1);
+continue;
+}
+
 if (c > 0x)
 return 1;

@@ -159,6 +165,12 @@ DECODER(gbk)
 NEXT_IN(1);
 continue;
 }
+
+if (c == 0x80) { /* cp936, or web GBK */
+OUTCHAR(0x20AC);
+NEXT_IN(1);
+continue;
+}

 REQUIRE_INBUF(2);

It should be mostly safe as this character is not previously defined in 
Python's GBK implementation.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24036] GB2312 codec is using a wrong covert table

2016-10-02 Thread Mingye Wang

Mingye Wang added the comment:

> Advice for final user:

This seems something worthy of adding to the codecs doc as a footnote. Perhaps 
something like "(deprecated) ... gb2312 is an obsolete encoding from the 1980s. 
Use gbk or gb18030 instead." will do.

> libiconv-1.14 is also using the wrong version.

Just a side note on the right/wrongfulness of libiconv: I have reported the 
GB18030 incompatibility as a libiconv bug.[1] From the replies, I learnt that 
1) what libiconv is using currently is a then-official mapping published on 
ftp.unicode.org; 2) vendor implementations of gb2312 differed historically. I 
have updated the corresponding section[2] on Wikipedia to include these old 
references.
  [1]: https://lists.gnu.org/archive/html/bug-gnu-libiconv/2016-09/msg4.html
  [2]: https://en.wikipedia.org/wiki/GB_2312#Two_implementations_of_GB2312

Still, being old and common does not necessarily mean being correct, as Ma Lin 
have demonstrated by showing the character semantics. To reflect this in a 
better-supported manner, I have added names for the glyphs in question from 
GB2312-80 to [2].

--
nosy: +Artoria2e5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24036>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28343] Bad encoding alias cp936 -> gbk: euro sign

2016-10-02 Thread Mingye Wang (Arthur2e5)

New submission from Mingye Wang (Arthur2e5):

Microsoft's cp936 defines a euro sign at 0x80, but Python would kick the bucket 
when asked to do something like `u'\u20ac'.encode('cp936')`. This may break 
things for zh-hans-cn windows users who wants to put a euro sign in their file 
name (if they insist on using a non-unicode str for open() in py2, well.)

By looking at the codecs documentation, 'cp936' appears to be an alias for the 
GBK encoder, which by itself has been a very ambiguous name and subject to 
confusion --

The name "GBK" might refer to any of the four commonly-known members of the 
family of EUC-CN (gb2312) extensions that has full coverage of Unicode 1.1 CJK 
Unified Ideographs block:
  1) The original GBK. Rust-Encoding says that it's in a normative annex of 
GB13000.1-1993, but the closest thing I can find in my archive.org copy of that 
standard is an annex on an EUC (GB/T 2311) UCS.
  2) IANA GBK, or Microsoft cp936. This is the one with the euro sign I am 
looking for.
  3) GBK 1.0, a recommendation from the official standardization committees 
based on cp936. It's roughly cp936 without the euro sign but with some 
additional 95 PUA code points. 
  4) W3C TR GBK. This GBK is basically gb18030-2005 without four-byte UTF, and 
with the euro sign. Roughly a union of 2) and 3) with some PUA code points 
moved into the right place.
 
Looking at Modules/cjkcodecs/_codecs_cn.c @ 104259:36b052adf5a7, Python seems 
to be doing either 1) or 3). For a quick fix you can just make an additional 
cp936 encoding around the gbk encoding that handles U+20AC; for some excitement 
(of potentially breaking stuff) you can join the web people and use either 2) 
or 4).

--
components: Unicode
messages: 277925
nosy: Mingye Wang (Arthur2e5), ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Bad encoding alias cp936 -> gbk: euro sign
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28343>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com