Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
I'm approaching this from the premise that we would like to avoid needless surprises for users not versed in text encoding. I did a simple experiment with notepad on Windows 7 as if a naïve user. If I write the one-line program: print("Hello world.") # by Jeff It runs, no surprise. We may legitimately encounter Unicode in string literals and comments. If I write: print("j't'kif Anaïs!") # par Hervé and try to save it, notepad tells me this file "contains characters in Unicode format which will be lost if you save this as an ANSI encoded text file." To keep the Unicode information I should cancel and choose a Unicode option. In the "Save as" dialogue the default encoding is ANSI. The second option "Unicode" is clearly right as the warning said "Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Good that worked. Closed and opened it looks exactly as I typed it. But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE. And running it I get: File "bonjour.py", line 1 SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details If I take the hint here and save as UTF-8, then it works, including printing the accent. Inspection of the bytes shows it starts with a UTF-8 BOM. In Jython I get the same results (choking on UTF-16), but saved as UTF-8, it works. I just have to make sure that's a Unicode constant if I want it to print correctly, as we're at 2.7. Jython has a checkered past with encodings, but tries to do exactly the same as CPython 2.7.x. Now, a fact I haven't mentioned is that my machine was localised to simplified Chinese (to diagnose some bug) during this test. If I re-localise to my usual English (UK), I do not get the guidance from notepad: instead it quietly saves as Latin-1 (cp1252), perhaps because I'm in Western Europe. Python baulks at this, at the first accented character. If I save from notepad as Unicode or UTF-8 the results are as before, including the BOM. In some circumstances, then, the natural result of using notepad and not sticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending on localisation, it seems. The Python error message provides a clue what a user should do, but they would need some background, a helpful teacher, or the Internet to sort it out. Jeff Allen On 15/11/2015 07:23, Stephen J. Turnbull wrote: Steve Dower writes: > Saying [UTF-16] is rarely used is rather exposing your own > unawareness though - it could arguably be the most commonly used > encoding (depending on how you define "used"). Because we're discussing the storage of .py files, the relevant definition is the one used by the Unicode Standard, of course: a text/plain stream intended to be manipulated by any conformant Unicode processor that claims to handle text/plain. File formats with in-band formatting codes and allowing embedded non-text content like Word, or operating system or stdlib APIs, don't count. Nor have I seen UTF-16 used in email or HTML since the unregretted days of Win2k betas[1] (but I don't frequent Windows- or Java-oriented sites, so I have to admit my experience is limited in a possibly relevant way). In Japan my impression is that modern versions of Windows have Memopad[sic] configured to emit UTF-8-with-signature by default for new files, and if not, the abomination known as Shift JIS (I'm not sure if that is a user or OEM option, though). Never a widechar encoding (after all, the whole point of Shift JIS was to use an 8-bit encoding for the katakana syllabary to save space or bandwidth). I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python programs, they probably already know how to convert them to UTF-8. As somebody already suggested, this can be delegated to the py.exe launcher, if necessary, AFAICS. I don't see any good reason for allowing non-ASCII-compatible encodings in the reference CPython interpreter. However, having mentioned Windows and Java, I have to wonder about IronPython and Jython, respectively. Having never lived in either of those environments, I don't know what text encoding their users might prefer (or even occasionally encounter) in Python program source. Steve Footnotes: [1] The version of Outlook Express shipped with them would emit "HTML" mail with ASCII tags and UTF-8-encoded text (even if it was encodable in pure ASCII). No, it wasn't spam, either, so it probably really was Outlook Express as it claimed to be in one of the headers. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
In a message of Sun, 15 Nov 2015 12:56:18 +, Paul Moore writes: >On 15 November 2015 at 07:23, Stephen J. Turnbullwrote: >> I don't see any good reason for allowing non-ASCII-compatible >> encodings in the reference CPython interpreter. > >>From PEP 263: > > Any encoding which allows processing the first two lines in the > way indicated above is allowed as source code encoding, this > includes ASCII compatible encodings as well as certain > multi-byte encodings such as Shift_JIS. It does not include > encodings which use two or more bytes for all characters like > e.g. UTF-16. The reason for this is to keep the encoding > detection algorithm in the tokenizer simple. > >So this pretty much confirms that double-byte encodings are not valid >for Python source files. > >Paul Steve Turnbull, who lives in Japan, and speaks and writes Japanese is saying that "he cannot see any reason for allowing non-ASCII compatible encodings in Cpython". This makes me wonder. Is this along the lines of 'even in Japan we do not want such things' or along the lines of 'when in Japan we want such things we want to so brutally do so much more, so keep the reference implementation simple, and don't try to help us with this seems-like-a-good-idea-but-isnt-in-practice' ideas like this one, or Laura ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
"Stephen J. Turnbull"writes: > I don't see any good reason for allowing non-ASCII-compatible > encodings in the reference CPython interpreter. There might be a case for having the tokenizer not care about encodings at all and just operate on a stream of unicode characters provided by a different layer. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 15 November 2015 at 07:23, Stephen J. Turnbullwrote: > I don't see any good reason for allowing non-ASCII-compatible > encodings in the reference CPython interpreter. >From PEP 263: Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. The reason for this is to keep the encoding detection algorithm in the tokenizer simple. So this pretty much confirms that double-byte encodings are not valid for Python source files. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 15 November 2015 at 16:40, Stephen J. Turnbullwrote: > What PEP 263 did do was to specify that non-ASCII-compatible encodings > are not supported by the PEP 263 mechanism for declaring the encoding > of a Python source program. That's because it looks for a "magic > number" which is the ASCII-encoded form of "coding:" in the first two > lines. It doesn't rule out alternative mechanisms for encoding > detection (specifically, use of the UTF-16 "BOM" signature); it just > doesn't propose implementing them. That was my initial thought. But combine this with the statement from the language docs that the default encoding when there is no PEP 263 encoding specification is UTF-8 (or ASCII in Python 2) and there's no valid way that I can see that a UTF-16 encoding could be valid (short of a formal language change). Anyway, Guido has spoken, so I'll leave it there. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
> On Nov 15, 2015, at 9:34 AM, Guido van Rossumwrote: > > Let me just unilaterally end this discussion. It's fine to disregard > the future possibility of using UTF-16 or -32 for Python source code. > Serhiy can happily rip out any comments or dead code dealing with that > possibility. Thank you. Raymond ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Random832 writes: > "Stephen J. Turnbull"writes: > > I don't see any good reason for allowing non-ASCII-compatible > > encodings in the reference CPython interpreter. > > There might be a case for having the tokenizer not care about encodings > at all and just operate on a stream of unicode characters provided by a > different layer. That's exactly what the PEP 263 implementation does in Python 2 (with the caveat that Python 2 doesn't know anything about Unicode, it's a UTF-8 stream and the non-ASCII characters are treated as bytes of unknown semantics, so they can't be used in syntax). I don't know about Python 3, I haven't looked at the decoding of source programs. But I would assume it implements PEP 263 still, except that since str is now either widechars or PEP 393 encoding (ie, flexible widechars) that encoding is now used instead of UTF-8. I'm sure that there are plenty of ASCII-isms in the tokenizer in the sense that it assumes the ASCII *character* (not byte) repertoire. But I'm not sure why Serhiy thinks that the tokenizer cares about the representation on-disk. But as I say, I haven't looked at the code so he might be right. Steve ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
On 14.11.2015 23:56, Victor Stinner wrote: > These encodings are rarely used. I don't think that any text editor use > them. Editors use ascii, latin1, utf8 and... all locale encoding. But I > don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk > space. UTF-16 is used a lot for Windows text files, e.g. Unicode CSV files (the save as "Unicode text file" option writes UTF-16). However, nowadays, all text editors also support UTF-8 and many of these recognize the UTF-8 BOM as identifier to detect Unicode text files. > Ok, even if it exists, Python already accepts a very wide range of > encoding. It is not worth to make the parser much more complex just to > support encodings which are also never used (for .py files). Agreed. In Python 2 we decided to only allow ASCII super-sets for Python source files, which out ruled multi-byte encodings such as UTF-16 and -32. I don't think we need to make the parser more complex just to support them. UTF-8 works fine as Python source code encoding. > Victor > Le 14 nov. 2015 20:20, "Serhiy Storchaka"a écrit : > >> For now UTF-16 and UTF-32 source encodings are not supported. There is a >> comment in Parser/tokenizer.c: >> >> /* Disable support for UTF-16 BOMs until a decision >>is made whether this needs to be supported. */ >> >> Can we make a decision whether this support will be added in foreseeable >> future (say in near 10 years), or no? >> >> Removing commented out and related code will help to refactor the >> tokenizer, and that can help to fix some existing bugs (e.g. issue14811, >> issue18961, issue20115 and may be others). Current tokenizing code is too >> tangled. >> >> If the support of UTF-16 and UTF-32 is planned, I'll take this to >> attention during refactoring. But in many places besides the tokenizer the >> ASCII compatible encoding of source files is expected. >> >> ___ >> Python-Dev mailing list >> Python-Dev@python.org >> https://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com >> > > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/mal%40egenix.com > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 15 2015) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ 2015-10-23: Released mxODBC Connect 2.1.5 ... http://egenix.com/go85 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Clarification of PEP 394 for scripts that run under Python 2 and 3
On 14 November 2015 at 08:57, Barry Warsawwrote: > On Nov 13, 2015, at 10:32 PM, Damien George wrote: > >>1. What is the true intent of PEP 394 when only Python 3 is installed? Is >>"python" available or not to run scripts compatible with 2.x and 3.x? >> >>2. Is it possible to write a shebang line that supports all variations of >>Python installations on *nix machines? >> >>3. If the answer to 2 is no, then what is the recommended way to support >>all Python installations with one standalone script? > > It's important to remember that PEP 394 is an informational PEP, still under > active (if dormant) discussion. It is also just a recommendation, and can't > force any downstream redistributors to do one thing or another. Still, the > intent is to provide a set of guidelines for the majority of *nix distributors > to (eventually) adopt. > > I'll also point you to the recently created linux-sig where this topic is > being discussed. > > https://mail.python.org/pipermail/linux-sig/2015-October/00.html > > As you've noticed, Arch took a particular approach, but speaking as part of > the Debian/Ubuntu community, don't expect that ecosystem to go down the same > path. It's highly unlikely /usr/bin/python will ever point to anything other > than Python 2, even when only Python 3 is installed by default. That might > change once Python 2.7 is actually EOL'd, and that won't happen for quite a > long time. PEP 373 currently says there will be bug fix releases until 2020, > and I'd expect security-only releases for some time after that. This is the same situation for the Fedora/RHEL/CentOS ecosystem - for the time being, we think failure -> "yum install /usr/bin/python" is the cleanest user experience we can readily offer for Python 3 only systems. However, we don't think it's an acceptable long term answer, and something *else* needs to be done with /usr/bin/python than just making it a symlink to Python 3 (since the latter will lead to cryptic error messages for Python 2 only scripts). > In general, the discussions on linux-sig and elsewhere is coalescing around a > launcher-type approach, where you'd put something like /usr/bin/py in your > shebang if your script really is bilingual. Then `py` can try to figure out > what's available on your system, what your preference is, etc. and then > execute your script using that version-specific binary. I would expect the > launcher to eventually be provided by the upstream Python development > community. A couple of other links folks may find worth reading for background: * LWN summary of the discussion at this year's language summit: https://lwn.net/Articles/640296/ * Geoffrey Thomas's launcher concept: https://ldpreload.com/blog/usr-bin-python-23 This thread prompted me to do an update pass over https://wiki.python.org/moin/Python3LinuxDistroPortingStatus, adding these background links to the Python symlink section, and updating the Fedora porting status link to refer to the recently published porting DB: http://portingdb-encukou.rhcloud.com/ Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Let me just unilaterally end this discussion. It's fine to disregard the future possibility of using UTF-16 or -32 for Python source code. Serhiy can happily rip out any comments or dead code dealing with that possibility. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings
Laura Creighton writes: > Steve Turnbull, who lives in Japan, and speaks and writes Japanese > is saying that "he cannot see any reason for allowing non-ASCII > compatible encodings in Cpython". > > This makes me wonder. > > Is this along the lines of 'even in Japan we do not want such > things' or along the lines of 'when in Japan we want such things > we want to so brutally do so much more, so keep the reference > implementation simple, and don't try to help us with this > seems-like-a-good-idea-but-isnt-in-practice' ideas like this one, > or > I'm saying that to my knowledge Japan is the most complicated place there is when it comes to encodings, and even so, nobody here seems to be using UTF-16 as the encoding for program sources (or any other text/* media). Of course as Steve Dower pointed out it's in heavy use as an internal text encoding, in OS APIs, in some languages' stdlib APIs (ie, Java and I suppose .NET), and I guess in single-application file formats (Word), but the programs that use those APIs are written in ASCII compatible-encodings (and Shift JIS and Big5). The Japanese don't need or want UTF-16 in text files, etc. Besides that, I can also say that PEP 263 didn't legislate the use of ASCII-compatible encodings. For one thing, Shift JIS and Big5 aren't 100% compatible because they uses 0x20-0x7f in multibyte characters. They're just close enough to ASCII compatible to mostly "just work", at least on Microsoft OSes provided by OEMs in the relevant countries. What PEP 263 did do was to specify that non-ASCII-compatible encodings are not supported by the PEP 263 mechanism for declaring the encoding of a Python source program. That's because it looks for a "magic number" which is the ASCII-encoded form of "coding:" in the first two lines. It doesn't rule out alternative mechanisms for encoding detection (specifically, use of the UTF-16 "BOM" signature); it just doesn't propose implementing them. IIRC nobody has ever asked for them, but I think the idea is absurd so I have to admit I may have seen a request and forgot it instantly. Bottom line: as long as Python (or the launcher) is able to transcode the source to the internal Unicode format (UTF-8 in Python 2, and widechar or PEP 393 in Python 3) before actually beginning parsing, any on-disk encoding is OK. But I just don't see a use case for UTF-16. If I'm wrong, I think that this feature should be added to launchers, not CPython, because it forces the decoder to know what formats other than ASCII are implemented and to try heuristics to guess, rather than just obeying the coding cookie. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com