[Python-Dev] Support of UTF-16 and UTF-32 source encodings
For now UTF-16 and UTF-32 source encodings are not supported. There is a comment in Parser/tokenizer.c: /* Disable support for UTF-16 BOMs until a decision is made whether this needs to be supported. */ Can we make a decision whether this support will be added in foreseeable future (say in near 10 years), or no? Removing commented out and related code will help to refactor the tokenizer, and that can help to fix some existing bugs (e.g. issue14811, issue18961, issue20115 and may be others). Current tokenizing code is too tangled. If the support of UTF-16 and UTF-32 is planned, I'll take this to attention during refactoring. But in many places besides the tokenizer the ASCII compatible encoding of source files is expected. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
These encodings are rarely used. I don't think that any text editor use them. Editors use ascii, latin1, utf8 and... all locale encoding. But I don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk space. Ok, even if it exists, Python already accepts a very wide range of encoding. It is not worth to make the parser much more complex just to support encodings which are also never used (for .py files). Victor Le 14 nov. 2015 20:20, "Serhiy Storchaka" a écrit : > For now UTF-16 and UTF-32 source encodings are not supported. There is a > comment in Parser/tokenizer.c: > > /* Disable support for UTF-16 BOMs until a decision >is made whether this needs to be supported. */ > > Can we make a decision whether this support will be added in foreseeable > future (say in near 10 years), or no? > > Removing commented out and related code will help to refactor the > tokenizer, and that can help to fix some existing bugs (e.g. issue14811, > issue18961, issue20115 and may be others). Current tokenizing code is too > tangled. > > If the support of UTF-16 and UTF-32 is planned, I'll take this to > attention during refactoring. But in many places besides the tokenizer the > ASCII compatible encoding of source files is expected. > > ___ > Python-Dev mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com > ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
I agree that supporting UTF-16 doesn't seem terribly useful. Also, thank you for giving the tokenizer some love! On Sat, Nov 14, 2015, at 11:19, Serhiy Storchaka wrote: > For now UTF-16 and UTF-32 source encodings are not supported. There is a > comment in Parser/tokenizer.c: > > /* Disable support for UTF-16 BOMs until a decision > is made whether this needs to be supported. */ > > Can we make a decision whether this support will be added in foreseeable > future (say in near 10 years), or no? > > Removing commented out and related code will help to refactor the > tokenizer, and that can help to fix some existing bugs (e.g. issue14811, > issue18961, issue20115 and may be others). Current tokenizing code is > too tangled. > > If the support of UTF-16 and UTF-32 is planned, I'll take this to > attention during refactoring. But in many places besides the tokenizer > the ASCII compatible encoding of source files is expected. > > ___ > Python-Dev mailing list > [email protected] > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/benjamin%40python.org ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 15.11.15 00:56, Victor Stinner wrote: These encodings are rarely used. I don't think that any text editor use them. Editors use ascii, latin1, utf8 and... all locale encoding. But I don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk space. AFAIK the standard Windows editor Notepad uses UTF-16. And I often encountered Windows resource files in UTF-16. UTF-16 was more popular than UTF-8 on Windows some time. If this horse is dead I'll throw it away. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 11/14/2015 3:21 PM, Serhiy Storchaka wrote: On 15.11.15 00:56, Victor Stinner wrote: These encodings are rarely used. I don't think that any text editor use them. Editors use ascii, latin1, utf8 and... all locale encoding. But I don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk space. AFAIK the standard Windows editor Notepad uses UTF-16. And I often encountered Windows resource files in UTF-16. UTF-16 was more popular than UTF-8 on Windows some time. If this horse is dead I'll throw it away. Just use UTF-8, ignoring an optional leading BOM. If someone wants to use something else, they can write a "preprocessor" to convert it to UTF-8 for use by Python. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Victor Stinner writes: > These encodings are rarely used. I don't think that any text editor > use them. MS Windows' Notepad can be made to use UTF-16. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
The native encoding on Windows has been UTF-16 since Windows NT. Obviously we've survived without Python tokenization support for a long time, but every API uses it. I've hit a few cases where it would have been handy for Python to be able to detect it, though nothing I couldn't work around. Saying it is rarely used is rather exposing your own unawareness though - it could arguably be the most commonly used encoding (depending on how you define "used"). Cheers, Steve Top-posted from my Windows Phone -Original Message- From: "Victor Stinner" Sent: 11/14/2015 14:58 To: "Serhiy Storchaka" Cc: "[email protected]" Subject: Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings These encodings are rarely used. I don't think that any text editor use them. Editors use ascii, latin1, utf8 and... all locale encoding. But I don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk space. Ok, even if it exists, Python already accepts a very wide range of encoding. It is not worth to make the parser much more complex just to support encodings which are also never used (for .py files). Victor Le 14 nov. 2015 20:20, "Serhiy Storchaka" a écrit : For now UTF-16 and UTF-32 source encodings are not supported. There is a comment in Parser/tokenizer.c: /* Disable support for UTF-16 BOMs until a decision is made whether this needs to be supported. */ Can we make a decision whether this support will be added in foreseeable future (say in near 10 years), or no? Removing commented out and related code will help to refactor the tokenizer, and that can help to fix some existing bugs (e.g. issue14811, issue18961, issue20115 and may be others). Current tokenizing code is too tangled. If the support of UTF-16 and UTF-32 is planned, I'll take this to attention during refactoring. But in many places besides the tokenizer the ASCII compatible encoding of source files is expected. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sun, Nov 15, 2015 at 12:06 PM, Steve Dower wrote: > The native encoding on Windows has been UTF-16 since Windows NT. Obviously > we've survived without Python tokenization support for a long time, but > every API uses it. > > I've hit a few cases where it would have been handy for Python to be able to > detect it, though nothing I couldn't work around. Saying it is rarely used > is rather exposing your own unawareness though - it could arguably be the > most commonly used encoding (depending on how you define "used"). What matters here is: How likely is it that an arbitrary Python script (or, say, "arbitrary text file") is encoded UTF-16 rather than something ASCII-compatible? I think even Notepad defaults to UTF-8 for files, now. The fact that it's sending text to the GUI subsystem in UTF-16 is immaterial here. Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix program loaders won't.) That alone might be a reason for strongly encouraging ASCII-compat encodings. ChrisA ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 11/14/2015 5:15 PM, Chris Angelico wrote: Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix program loaders won't.) That alone might be a reason for strongly encouraging ASCII-compat encodings. That raises an interesting question about if py.exe can handle a leading UTF-8 BOM. I have my emacs-on-Windows configured to store UTF-8 without BOM, but Notepad would put a BOM when saving UTF-8, last I checked. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Chris Angelico writes: > Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix > program loaders won't.) A lot of them can't handle UTF-8 with a BOM, either. > That alone might be a reason for strongly encouraging ASCII-compat > encodings. A "python" or "python3" or "env" executable in any particular location such as /usr/bin isn't technically guaranteed, either, it's just relied on as a "works 99% of the time" thing. There's a reasonable case to be made that transforming files in such a way as they get launched by python (which may require an encoding change to an ASCII-compatible encoding, or a wrapper script, or the python -x hack) is the responsibility of platform-specific installer code. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 11/14/2015 5:15 PM, Chris Angelico wrote: I think even Notepad defaults to UTF-8 for files, now. Just installed Windows 10 on a new machine, and upgraded to the latest Windows 10 release, 1511. Notepad defaults to ANSI encoding, as I think it always has. UTF-8 is an option, and it does seem to try to notice the original encoding of the file, when editing old files, but when creating a new one ANSI. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman wrote: > Notepad defaults to ANSI encoding, as I think it always has. UTF-8 is an > option, and it does seem to try to notice the original encoding of the file, > when editing old files, but when creating a new one ANSI. Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding? ChrisA ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 11/14/2015 5:37 PM, Chris Angelico wrote: On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman wrote: Notepad defaults to ANSI encoding, as I think it always has. UTF-8 is an option, and it does seem to try to notice the original encoding of the file, when editing old files, but when creating a new one ANSI. Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding? I wouldn't trust an answer to this question that didn't come from someone that used Windows with Chinese, Japanese, or Korean, as their default language for the install. So I don't have a trustworthy answer to give. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sun, Nov 15, 2015 at 12:47 PM, Glenn Linderman wrote: > On 11/14/2015 5:37 PM, Chris Angelico wrote: > > On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman > wrote: > > Notepad defaults to ANSI encoding, as I think it always has. UTF-8 is an > option, and it does seem to try to notice the original encoding of the file, > when editing old files, but when creating a new one ANSI. > > Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding? > > > I wouldn't trust an answer to this question that didn't come from someone > that used Windows with Chinese, Japanese, or Korean, as their default > language for the install. So I don't have a trustworthy answer to give. > Heh, yeah. But I'd trust an answer from Steve Dower :) ChrisA ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Glenn Linderman writes: > On 11/14/2015 5:37 PM, Chris Angelico wrote: > > Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding? > > I wouldn't trust an answer to this question that didn't come from > someone that used Windows with Chinese, Japanese, or Korean, as their > default language for the install. So I don't have a trustworthy answer > to give. AFAIK (I haven't actually used it as a default language, but I do know some details of their encodings) There are two main "issues" with the windows CJK encodings regarding ASCII compatibility: - There is a different symbol (a currency symbol) at 0x5c. Sort of. Most unicode translations of it will treat it as a backslash, and users do expect it to work for things like \n, path separators, etc, but it displays as ¥ or ₩. - Dual-byte characters can have ASCII bytes as their *second* byte. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sat, Nov 14, 2015 at 09:19:37PM +0200, Serhiy Storchaka wrote: > If the support of UTF-16 and UTF-32 is planned, I'll take this to > attention during refactoring. But in many places besides the tokenizer > the ASCII compatible encoding of source files is expected. Perhaps another way of looking at this: Is it feasible to drop support for arbitrary encodings and just require UTF-8 (with or without a pseudo-BOM)? -- Steve ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sat, Nov 14, 2015 at 7:06 PM, Steve Dower wrote: > The native encoding on Windows has been UTF-16 since Windows NT. Obviously > we've survived without Python tokenization support for a long time, but > every API uses it. Windows 2000 was the first version to have broad support for UTF-16. Windows NT (1993) was released before UTF-16, so its Unicode support is limited to UCS-2. (Note that console windows still restrict each character cell to a single WCHAR character. So a non-BMP character encoded as a UTF-16 surrogate pair always appears as two box glyphs. Of course you can copy and paste from the console to a UTF-16 aware window just fine.) > I've hit a few cases where it would have been handy for Python to be able to > detect it, though nothing I couldn't work around. Can you elaborate some example cases? I can see using UTF-16 for the REPL in the Windows console, but a hypothetical WinConIO class could simply transcode to and from UTF-8. Drekin's win-unicode-console package works like this. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On Sat, Nov 14, 2015 at 7:15 PM, Chris Angelico wrote: > Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix > program loaders won't.) That alone might be a reason for strongly > encouraging ASCII-compat encodings. The launcher supports shebangs encoded as UTF-8 (default), UTF-16 (LE/BE), and UTF-32 (LE/BE): https://hg.python.org/cpython/file/v3.5.0/PC/launcher.c#l1138 ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Steve Dower writes: > Saying [UTF-16] is rarely used is rather exposing your own > unawareness though - it could arguably be the most commonly used > encoding (depending on how you define "used"). Because we're discussing the storage of .py files, the relevant definition is the one used by the Unicode Standard, of course: a text/plain stream intended to be manipulated by any conformant Unicode processor that claims to handle text/plain. File formats with in-band formatting codes and allowing embedded non-text content like Word, or operating system or stdlib APIs, don't count. Nor have I seen UTF-16 used in email or HTML since the unregretted days of Win2k betas[1] (but I don't frequent Windows- or Java-oriented sites, so I have to admit my experience is limited in a possibly relevant way). In Japan my impression is that modern versions of Windows have Memopad[sic] configured to emit UTF-8-with-signature by default for new files, and if not, the abomination known as Shift JIS (I'm not sure if that is a user or OEM option, though). Never a widechar encoding (after all, the whole point of Shift JIS was to use an 8-bit encoding for the katakana syllabary to save space or bandwidth). I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python programs, they probably already know how to convert them to UTF-8. As somebody already suggested, this can be delegated to the py.exe launcher, if necessary, AFAICS. I don't see any good reason for allowing non-ASCII-compatible encodings in the reference CPython interpreter. However, having mentioned Windows and Java, I have to wonder about IronPython and Jython, respectively. Having never lived in either of those environments, I don't know what text encoding their users might prefer (or even occasionally encounter) in Python program source. Steve Footnotes: [1] The version of Outlook Express shipped with them would emit "HTML" mail with ASCII tags and UTF-8-encoded text (even if it was encodable in pure ASCII). No, it wasn't spam, either, so it probably really was Outlook Express as it claimed to be in one of the headers. ___ Python-Dev mailing list [email protected] https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
