Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Jeff Allen
I'm approaching this from the premise that we would like to avoid 
needless surprises for users not versed in text encoding. I did a simple 
experiment with notepad on Windows 7 as if a naïve user. If I write the 
one-line program:


print("Hello world.") # by Jeff

It runs, no surprise.

We may legitimately encounter Unicode in string literals and comments. 
If I write:


print("j't'kif Anaïs!") # par Hervé

and try to save it, notepad tells me this file "contains characters in 
Unicode format which will be lost if you save this as an ANSI encoded 
text file." To keep the Unicode information I should cancel and choose a 
Unicode option. In the "Save as" dialogue the default encoding is ANSI. 
The second option "Unicode" is clearly right as the warning said 
"Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Good 
that worked. Closed and opened it looks exactly as I typed it.


But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE. 
And running it I get:

  File "bonjour.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py on 
line 1, but no encoding declared; see 
http://python.org/dev/peps/pep-0263/ for details


If I take the hint here and save as UTF-8, then it works, including 
printing the accent. Inspection of the bytes shows it starts with a 
UTF-8 BOM.


In Jython I get the same results (choking on UTF-16), but saved as 
UTF-8, it works. I just have to make sure that's a Unicode constant if I 
want it to print correctly, as we're at 2.7. Jython has a checkered past 
with encodings, but tries to do exactly the same as CPython 2.7.x.


Now, a fact I haven't mentioned is that my machine was localised to 
simplified Chinese (to diagnose some bug) during this test. If I 
re-localise to my usual English (UK), I do not get the guidance from 
notepad: instead it quietly saves as Latin-1 (cp1252), perhaps because 
I'm in Western Europe. Python baulks at this, at the first accented 
character. If I save from notepad as Unicode or UTF-8 the results are as 
before, including the BOM.


In some circumstances, then, the natural result of using notepad and not 
sticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending on 
localisation, it seems. The Python error message provides a clue what a 
user should do, but they would need some background, a helpful teacher, 
or the Internet to sort it out.


Jeff Allen

On 15/11/2015 07:23, Stephen J. Turnbull wrote:

Steve Dower writes:

  > Saying [UTF-16] is rarely used is rather exposing your own
  > unawareness though - it could arguably be the most commonly used
  > encoding (depending on how you define "used").

Because we're discussing the storage of .py files, the relevant
definition is the one used by the Unicode Standard, of course: a
text/plain stream intended to be manipulated by any conformant Unicode
processor that claims to handle text/plain.  File formats with in-band
formatting codes and allowing embedded non-text content like Word, or
operating system or stdlib APIs, don't count.  Nor have I seen UTF-16
used in email or HTML since the unregretted days of Win2k betas[1]
(but I don't frequent Windows- or Java-oriented sites, so I have to
admit my experience is limited in a possibly relevant way).

In Japan my impression is that modern versions of Windows have
Memopad[sic] configured to emit UTF-8-with-signature by default for
new files, and if not, the abomination known as Shift JIS (I'm not
sure if that is a user or OEM option, though).  Never a widechar
encoding (after all, the whole point of Shift JIS was to use an 8-bit
encoding for the katakana syllabary to save space or bandwidth).

I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
programs, they probably already know how to convert them to UTF-8.  As
somebody already suggested, this can be delegated to the py.exe
launcher, if necessary, AFAICS.

I don't see any good reason for allowing non-ASCII-compatible
encodings in the reference CPython interpreter.

However, having mentioned Windows and Java, I have to wonder about
IronPython and Jython, respectively.  Having never lived in either of
those environments, I don't know what text encoding their users might
prefer (or even occasionally encounter) in Python program source.

Steve

Footnotes:
[1]  The version of Outlook Express shipped with them would emit
"HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
encodable in pure ASCII).  No, it wasn't spam, either, so it probably
really was Outlook Express as it claimed to be in one of the headers.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 

Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Laura Creighton
In a message of Sun, 15 Nov 2015 12:56:18 +, Paul Moore writes:
>On 15 November 2015 at 07:23, Stephen J. Turnbull  wrote:
>> I don't see any good reason for allowing non-ASCII-compatible
>> encodings in the reference CPython interpreter.
>
>>From PEP 263:
>
>   Any encoding which allows processing the first two lines in the
>   way indicated above is allowed as source code encoding, this
>   includes ASCII compatible encodings as well as certain
>   multi-byte encodings such as Shift_JIS. It does not include
>   encodings which use two or more bytes for all characters like
>   e.g. UTF-16. The reason for this is to keep the encoding
>   detection algorithm in the tokenizer simple.
>
>So this pretty much confirms that double-byte encodings are not valid
>for Python source files.
>
>Paul

Steve Turnbull, who lives in Japan, and speaks and writes Japanese
is saying that "he cannot see any reason for allowing non-ASCII
compatible encodings in Cpython".

This makes me wonder.

Is this along the lines of 'even in Japan we do not want such
things' or along the lines of 'when in Japan we want such things
we want to so brutally do so much more, so keep the reference
implementation simple, and don't try to help us with this 
seems-like-a-good-idea-but-isnt-in-practice' ideas like this one,
or


Laura
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Random832
"Stephen J. Turnbull"  writes:
> I don't see any good reason for allowing non-ASCII-compatible
> encodings in the reference CPython interpreter.

There might be a case for having the tokenizer not care about encodings
at all and just operate on a stream of unicode characters provided by a
different layer.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Paul Moore
On 15 November 2015 at 07:23, Stephen J. Turnbull  wrote:
> I don't see any good reason for allowing non-ASCII-compatible
> encodings in the reference CPython interpreter.

>From PEP 263:

   Any encoding which allows processing the first two lines in the
   way indicated above is allowed as source code encoding, this
   includes ASCII compatible encodings as well as certain
   multi-byte encodings such as Shift_JIS. It does not include
   encodings which use two or more bytes for all characters like
   e.g. UTF-16. The reason for this is to keep the encoding
   detection algorithm in the tokenizer simple.

So this pretty much confirms that double-byte encodings are not valid
for Python source files.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Paul Moore
On 15 November 2015 at 16:40, Stephen J. Turnbull  wrote:
> What PEP 263 did do was to specify that non-ASCII-compatible encodings
> are not supported by the PEP 263 mechanism for declaring the encoding
> of a Python source program.  That's because it looks for a "magic
> number" which is the ASCII-encoded form of "coding:" in the first two
> lines.  It doesn't rule out alternative mechanisms for encoding
> detection (specifically, use of the UTF-16 "BOM" signature); it just
> doesn't propose implementing them.

That was my initial thought. But combine this with the statement from
the language docs that the default encoding when there is no PEP 263
encoding specification is UTF-8 (or ASCII in Python 2) and there's no
valid way that I can see that a UTF-16 encoding could be valid (short
of a formal language change).

Anyway, Guido has spoken, so I'll leave it there.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Raymond Hettinger

> On Nov 15, 2015, at 9:34 AM, Guido van Rossum  wrote:
> 
> Let me just unilaterally end this discussion. It's fine to disregard
> the future possibility of using UTF-16 or -32 for Python source code.
> Serhiy can happily rip out any comments or dead code dealing with that
> possibility.

Thank you.


Raymond
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Stephen J. Turnbull
Random832 writes:
 > "Stephen J. Turnbull"  writes:
 > > I don't see any good reason for allowing non-ASCII-compatible
 > > encodings in the reference CPython interpreter.
 > 
 > There might be a case for having the tokenizer not care about encodings
 > at all and just operate on a stream of unicode characters provided by a
 > different layer.

That's exactly what the PEP 263 implementation does in Python 2 (with
the caveat that Python 2 doesn't know anything about Unicode, it's a
UTF-8 stream and the non-ASCII characters are treated as bytes of
unknown semantics, so they can't be used in syntax).  I don't know
about Python 3, I haven't looked at the decoding of source programs.
But I would assume it implements PEP 263 still, except that since str
is now either widechars or PEP 393 encoding (ie, flexible widechars)
that encoding is now used instead of UTF-8.

I'm sure that there are plenty of ASCII-isms in the tokenizer in the
sense that it assumes the ASCII *character* (not byte) repertoire.
But I'm not sure why Serhiy thinks that the tokenizer cares about the
representation on-disk.  But as I say, I haven't looked at the code so
he might be right.

Steve

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread M.-A. Lemburg
On 14.11.2015 23:56, Victor Stinner wrote:
> These encodings are rarely used. I don't think that any text editor use
> them. Editors use ascii, latin1, utf8 and... all locale encoding. But I
> don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk
> space.

UTF-16 is used a lot for Windows text files, e.g. Unicode
CSV files (the save as "Unicode text file" option writes
UTF-16).

However, nowadays, all text editors also support UTF-8 and
many of these recognize the UTF-8 BOM as identifier to detect
Unicode text files.

> Ok, even if it exists, Python already accepts a very wide range of
> encoding. It is not worth to make the parser much more complex just to
> support encodings which are also never used (for .py files).

Agreed. In Python 2 we decided to only allow ASCII super-sets
for Python source files, which out ruled multi-byte encodings
such as UTF-16 and -32. I don't think we need to make the parser
more complex just to support them. UTF-8 works fine as Python
source code encoding.

> Victor
> Le 14 nov. 2015 20:20, "Serhiy Storchaka"  a écrit :
> 
>> For now UTF-16 and UTF-32 source encodings are not supported. There is a
>> comment in Parser/tokenizer.c:
>>
>> /* Disable support for UTF-16 BOMs until a decision
>>is made whether this needs to be supported.  */
>>
>> Can we make a decision whether this support will be added in foreseeable
>> future (say in near 10 years), or no?
>>
>> Removing commented out and related code will help to refactor the
>> tokenizer, and that can help to fix some existing bugs (e.g. issue14811,
>> issue18961, issue20115 and may be others). Current tokenizing code is too
>> tangled.
>>
>> If the support of UTF-16 and UTF-32 is planned, I'll take this to
>> attention during refactoring. But in many places besides the tokenizer the
>> ASCII compatible encoding of source files is expected.
>>
>> ___
>> Python-Dev mailing list
>> Python-Dev@python.org
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
>>
> 
> 
> 
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/mal%40egenix.com
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2015)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/

2015-10-23: Released mxODBC Connect 2.1.5 ... http://egenix.com/go85

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Guido van Rossum
Let me just unilaterally end this discussion. It's fine to disregard
the future possibility of using UTF-16 or -32 for Python source code.
Serhiy can happily rip out any comments or dead code dealing with that
possibility.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-15 Thread Stephen J. Turnbull
Laura Creighton writes:

 > Steve Turnbull, who lives in Japan, and speaks and writes Japanese
 > is saying that "he cannot see any reason for allowing non-ASCII
 > compatible encodings in Cpython".
 > 
 > This makes me wonder.
 > 
 > Is this along the lines of 'even in Japan we do not want such
 > things' or along the lines of 'when in Japan we want such things
 > we want to so brutally do so much more, so keep the reference
 > implementation simple, and don't try to help us with this 
 > seems-like-a-good-idea-but-isnt-in-practice' ideas like this one,
 > or
 > 

I'm saying that to my knowledge Japan is the most complicated place
there is when it comes to encodings, and even so, nobody here seems to
be using UTF-16 as the encoding for program sources (or any other
text/* media).  Of course as Steve Dower pointed out it's in heavy use
as an internal text encoding, in OS APIs, in some languages' stdlib
APIs (ie, Java and I suppose .NET), and I guess in single-application
file formats (Word), but the programs that use those APIs are written
in ASCII compatible-encodings (and Shift JIS and Big5).  The Japanese
don't need or want UTF-16 in text files, etc.

Besides that, I can also say that PEP 263 didn't legislate the use of
ASCII-compatible encodings.  For one thing, Shift JIS and Big5 aren't
100% compatible because they uses 0x20-0x7f in multibyte characters.
They're just close enough to ASCII compatible to mostly "just work",
at least on Microsoft OSes provided by OEMs in the relevant countries.

What PEP 263 did do was to specify that non-ASCII-compatible encodings
are not supported by the PEP 263 mechanism for declaring the encoding
of a Python source program.  That's because it looks for a "magic
number" which is the ASCII-encoded form of "coding:" in the first two
lines.  It doesn't rule out alternative mechanisms for encoding
detection (specifically, use of the UTF-16 "BOM" signature); it just
doesn't propose implementing them.

IIRC nobody has ever asked for them, but I think the idea is absurd
so I have to admit I may have seen a request and forgot it instantly.

Bottom line: as long as Python (or the launcher) is able to transcode
the source to the internal Unicode format (UTF-8 in Python 2, and
widechar or PEP 393 in Python 3) before actually beginning parsing,
any on-disk encoding is OK.  But I just don't see a use case for
UTF-16.  If I'm wrong, I think that this feature should be added to
launchers, not CPython, because it forces the decoder to know what
formats other than ASCII are implemented and to try heuristics to
guess, rather than just obeying the coding cookie.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Chris Angelico
On Sun, Nov 15, 2015 at 12:47 PM, Glenn Linderman  wrote:
> On 11/14/2015 5:37 PM, Chris Angelico wrote:
>
> On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman 
> wrote:
>
> Notepad defaults to ANSI encoding, as I think it always has.  UTF-8 is an
> option, and it does seem to try to notice the original encoding of the file,
> when editing old files, but when creating a new one ANSI.
>
> Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding?
>
>
> I wouldn't trust an answer to this question that didn't come from someone
> that used Windows with Chinese, Japanese, or Korean, as their default
> language for the install. So I don't have a trustworthy answer to give.
>

Heh, yeah. But I'd trust an answer from Steve Dower :)

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread eryksun
On Sat, Nov 14, 2015 at 7:06 PM, Steve Dower  wrote:
> The native encoding on Windows has been UTF-16 since Windows NT. Obviously
> we've survived without Python tokenization support for a long time, but
> every API uses it.

Windows 2000 was the first version to have broad support for UTF-16.
Windows NT (1993) was released before UTF-16, so its Unicode support
is limited to UCS-2.

(Note that console windows still restrict each character cell to a
single WCHAR character. So a non-BMP character encoded as a UTF-16
surrogate pair always appears as two box glyphs. Of course you can
copy and paste from the console to a UTF-16 aware window just fine.)

> I've hit a few cases where it would have been handy for Python to be able to
> detect it, though nothing I couldn't work around.

Can you elaborate some example cases? I can see using UTF-16 for the
REPL in the Windows console, but a hypothetical WinConIO class could
simply transcode to and from UTF-8. Drekin's win-unicode-console
package works like this.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread eryksun
On Sat, Nov 14, 2015 at 7:15 PM, Chris Angelico  wrote:
> Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix
> program loaders won't.) That alone might be a reason for strongly
> encouraging ASCII-compat encodings.

The launcher supports shebangs encoded as UTF-8 (default), UTF-16
(LE/BE), and UTF-32 (LE/BE):

https://hg.python.org/cpython/file/v3.5.0/PC/launcher.c#l1138
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Stephen J. Turnbull
Steve Dower writes:

 > Saying [UTF-16] is rarely used is rather exposing your own
 > unawareness though - it could arguably be the most commonly used
 > encoding (depending on how you define "used").

Because we're discussing the storage of .py files, the relevant
definition is the one used by the Unicode Standard, of course: a
text/plain stream intended to be manipulated by any conformant Unicode
processor that claims to handle text/plain.  File formats with in-band
formatting codes and allowing embedded non-text content like Word, or
operating system or stdlib APIs, don't count.  Nor have I seen UTF-16
used in email or HTML since the unregretted days of Win2k betas[1]
(but I don't frequent Windows- or Java-oriented sites, so I have to
admit my experience is limited in a possibly relevant way).

In Japan my impression is that modern versions of Windows have
Memopad[sic] configured to emit UTF-8-with-signature by default for
new files, and if not, the abomination known as Shift JIS (I'm not
sure if that is a user or OEM option, though).  Never a widechar
encoding (after all, the whole point of Shift JIS was to use an 8-bit
encoding for the katakana syllabary to save space or bandwidth).

I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
programs, they probably already know how to convert them to UTF-8.  As
somebody already suggested, this can be delegated to the py.exe
launcher, if necessary, AFAICS.

I don't see any good reason for allowing non-ASCII-compatible
encodings in the reference CPython interpreter.

However, having mentioned Windows and Java, I have to wonder about
IronPython and Jython, respectively.  Having never lived in either of
those environments, I don't know what text encoding their users might
prefer (or even occasionally encounter) in Python program source.

Steve

Footnotes: 
[1]  The version of Outlook Express shipped with them would emit
"HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
encodable in pure ASCII).  No, it wasn't spam, either, so it probably
really was Outlook Express as it claimed to be in one of the headers.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Random832
Glenn Linderman  writes:
> On 11/14/2015 5:37 PM, Chris Angelico wrote:
> > Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding?
>
> I wouldn't trust an answer to this question that didn't come from
> someone that used Windows with Chinese, Japanese, or Korean, as their
> default language for the install. So I don't have a trustworthy answer
> to give.

AFAIK (I haven't actually used it as a default language, but I do know
some details of their encodings) There are two main "issues" with the
windows CJK encodings regarding ASCII compatibility:

- There is a different symbol (a currency symbol) at 0x5c. Sort of. Most
  unicode translations of it will treat it as a backslash, and users do
  expect it to work for things like \n, path separators, etc, but it
  displays as ¥ or ₩.

- Dual-byte characters can have ASCII bytes as their *second* byte.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Steven D'Aprano
On Sat, Nov 14, 2015 at 09:19:37PM +0200, Serhiy Storchaka wrote:

> If the support of UTF-16 and UTF-32 is planned, I'll take this to 
> attention during refactoring. But in many places besides the tokenizer 
> the ASCII compatible encoding of source files is expected.

Perhaps another way of looking at this:

Is it feasible to drop support for arbitrary encodings and just require 
UTF-8 (with or without a pseudo-BOM)?



-- 
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Serhiy Storchaka
For now UTF-16 and UTF-32 source encodings are not supported. There is a 
comment in Parser/tokenizer.c:


/* Disable support for UTF-16 BOMs until a decision
   is made whether this needs to be supported.  */

Can we make a decision whether this support will be added in foreseeable 
future (say in near 10 years), or no?


Removing commented out and related code will help to refactor the 
tokenizer, and that can help to fix some existing bugs (e.g. issue14811, 
issue18961, issue20115 and may be others). Current tokenizing code is 
too tangled.


If the support of UTF-16 and UTF-32 is planned, I'll take this to 
attention during refactoring. But in many places besides the tokenizer 
the ASCII compatible encoding of source files is expected.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Victor Stinner
These encodings are rarely used. I don't think that any text editor use
them. Editors use ascii, latin1, utf8 and... all locale encoding. But I
don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk
space.

Ok, even if it exists, Python already accepts a very wide range of
encoding. It is not worth to make the parser much more complex just to
support encodings which are also never used (for .py files).

Victor
Le 14 nov. 2015 20:20, "Serhiy Storchaka"  a écrit :

> For now UTF-16 and UTF-32 source encodings are not supported. There is a
> comment in Parser/tokenizer.c:
>
> /* Disable support for UTF-16 BOMs until a decision
>is made whether this needs to be supported.  */
>
> Can we make a decision whether this support will be added in foreseeable
> future (say in near 10 years), or no?
>
> Removing commented out and related code will help to refactor the
> tokenizer, and that can help to fix some existing bugs (e.g. issue14811,
> issue18961, issue20115 and may be others). Current tokenizing code is too
> tangled.
>
> If the support of UTF-16 and UTF-32 is planned, I'll take this to
> attention during refactoring. But in many places besides the tokenizer the
> ASCII compatible encoding of source files is expected.
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Benjamin Peterson
I agree that supporting UTF-16 doesn't seem terribly useful. Also, thank
you for giving the tokenizer some love!

On Sat, Nov 14, 2015, at 11:19, Serhiy Storchaka wrote:
> For now UTF-16 and UTF-32 source encodings are not supported. There is a 
> comment in Parser/tokenizer.c:
> 
>  /* Disable support for UTF-16 BOMs until a decision
> is made whether this needs to be supported.  */
> 
> Can we make a decision whether this support will be added in foreseeable 
> future (say in near 10 years), or no?
> 
> Removing commented out and related code will help to refactor the 
> tokenizer, and that can help to fix some existing bugs (e.g. issue14811, 
> issue18961, issue20115 and may be others). Current tokenizing code is 
> too tangled.
> 
> If the support of UTF-16 and UTF-32 is planned, I'll take this to 
> attention during refactoring. But in many places besides the tokenizer 
> the ASCII compatible encoding of source files is expected.
> 
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/benjamin%40python.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Serhiy Storchaka

On 15.11.15 00:56, Victor Stinner wrote:

These encodings are rarely used. I don't think that any text editor use
them. Editors use ascii, latin1, utf8 and... all locale encoding. But I
don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk
space.


AFAIK the standard Windows editor Notepad uses UTF-16. And I often 
encountered Windows resource files in UTF-16. UTF-16 was more popular 
than UTF-8 on Windows some time. If this horse is dead I'll throw it away.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Glenn Linderman

On 11/14/2015 3:21 PM, Serhiy Storchaka wrote:

On 15.11.15 00:56, Victor Stinner wrote:

These encodings are rarely used. I don't think that any text editor use
them. Editors use ascii, latin1, utf8 and... all locale encoding. But I
don't know any OS using UTF-16 as a locale encoding. UTF-32 wastes disk
space.


AFAIK the standard Windows editor Notepad uses UTF-16. And I often 
encountered Windows resource files in UTF-16. UTF-16 was more popular 
than UTF-8 on Windows some time. If this horse is dead I'll throw it away.


Just use UTF-8, ignoring an optional leading BOM. If someone wants to 
use something else, they can write a "preprocessor" to convert it to 
UTF-8 for use by Python.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Random832
Victor Stinner  writes:
> These encodings are rarely used. I don't think that any text editor
> use them.

MS Windows' Notepad can be made to use UTF-16.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Steve Dower
The native encoding on Windows has been UTF-16 since Windows NT. Obviously 
we've survived without Python tokenization support for a long time, but every 
API uses it.

I've hit a few cases where it would have been handy for Python to be able to 
detect it, though nothing I couldn't work around. Saying it is rarely used is 
rather exposing your own unawareness though - it could arguably be the most 
commonly used encoding (depending on how you define "used").

Cheers,
Steve

Top-posted from my Windows Phone

-Original Message-
From: "Victor Stinner" <victor.stin...@gmail.com>
Sent: ‎11/‎14/‎2015 14:58
To: "Serhiy Storchaka" <storch...@gmail.com>
Cc: "python-dev@python.org" <python-dev@python.org>
Subject: Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

These encodings are rarely used. I don't think that any text editor use them. 
Editors use ascii, latin1, utf8 and... all locale encoding. But I don't know 
any OS using UTF-16 as a locale encoding. UTF-32 wastes disk space.
Ok, even if it exists, Python already accepts a very wide range of encoding. It 
is not worth to make the parser much more complex just to support encodings 
which are also never used (for .py files).
Victor

Le 14 nov. 2015 20:20, "Serhiy Storchaka" <storch...@gmail.com> a écrit :

For now UTF-16 and UTF-32 source encodings are not supported. There is a 
comment in Parser/tokenizer.c:

/* Disable support for UTF-16 BOMs until a decision
   is made whether this needs to be supported.  */

Can we make a decision whether this support will be added in foreseeable future 
(say in near 10 years), or no?

Removing commented out and related code will help to refactor the tokenizer, 
and that can help to fix some existing bugs (e.g. issue14811, issue18961, 
issue20115 and may be others). Current tokenizing code is too tangled.

If the support of UTF-16 and UTF-32 is planned, I'll take this to attention 
during refactoring. But in many places besides the tokenizer the ASCII 
compatible encoding of source files is expected.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Chris Angelico
On Sun, Nov 15, 2015 at 12:06 PM, Steve Dower  wrote:
> The native encoding on Windows has been UTF-16 since Windows NT. Obviously
> we've survived without Python tokenization support for a long time, but
> every API uses it.
>
> I've hit a few cases where it would have been handy for Python to be able to
> detect it, though nothing I couldn't work around. Saying it is rarely used
> is rather exposing your own unawareness though - it could arguably be the
> most commonly used encoding (depending on how you define "used").

What matters here is: How likely is it that an arbitrary Python script
(or, say, "arbitrary text file") is encoded UTF-16 rather than
something ASCII-compatible? I think even Notepad defaults to UTF-8 for
files, now. The fact that it's sending text to the GUI subsystem in
UTF-16 is immaterial here.

Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix
program loaders won't.) That alone might be a reason for strongly
encouraging ASCII-compat encodings.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Glenn Linderman

On 11/14/2015 5:15 PM, Chris Angelico wrote:

Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix
program loaders won't.) That alone might be a reason for strongly
encouraging ASCII-compat encodings.


That raises an interesting question about if py.exe can handle a leading 
UTF-8 BOM.  I have my emacs-on-Windows configured to store UTF-8 without 
BOM, but Notepad would put a BOM when saving UTF-8, last I checked.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Random832
Chris Angelico  writes:
> Can the py.exe launcher handle a UTF-16 shebang? (I'm pretty sure Unix
> program loaders won't.)

A lot of them can't handle UTF-8 with a BOM, either.

> That alone might be a reason for strongly encouraging ASCII-compat
> encodings.

A "python" or "python3" or "env" executable in any particular location
such as /usr/bin isn't technically guaranteed, either, it's just relied
on as a "works 99% of the time" thing. There's a reasonable case to be
made that transforming files in such a way as they get launched by
python (which may require an encoding change to an ASCII-compatible
encoding, or a wrapper script, or the python -x hack) is the
responsibility of platform-specific installer code.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Glenn Linderman

On 11/14/2015 5:15 PM, Chris Angelico wrote:

I think even Notepad defaults to UTF-8 for
files, now.


Just installed Windows 10 on a new machine, and upgraded to the latest 
Windows 10 release, 1511.


Notepad defaults to ANSI encoding, as I think it always has.  UTF-8 is 
an option, and it does seem to try to notice the original encoding of 
the file, when editing old files, but when creating a new one ANSI.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Chris Angelico
On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman  wrote:
> Notepad defaults to ANSI encoding, as I think it always has.  UTF-8 is an
> option, and it does seem to try to notice the original encoding of the file,
> when editing old files, but when creating a new one ANSI.

Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Support of UTF-16 and UTF-32 source encodings

2015-11-14 Thread Glenn Linderman

On 11/14/2015 5:37 PM, Chris Angelico wrote:

On Sun, Nov 15, 2015 at 12:27 PM, Glenn Linderman  wrote:

Notepad defaults to ANSI encoding, as I think it always has.  UTF-8 is an
option, and it does seem to try to notice the original encoding of the file,
when editing old files, but when creating a new one ANSI.

Thanks. Is "ANSI" always an eight-bit ASCII-compatible encoding?


I wouldn't trust an answer to this question that didn't come from 
someone that used Windows with Chinese, Japanese, or Korean, as their 
default language for the install. So I don't have a trustworthy answer 
to give.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com