Re: [Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8

2016-09-03 Thread Adam Bartoš
Nick Coghlan (ncoghlan at gmail.com) on Sat Sep 3 12:27:44 EDT 2016 wrote:

> After also reading the Windows console encoding PEP, I realised
> there's a couple of missing discussions here regarding the impacts on
> sys.argv, os.environ, and os.environb.
>
> The reason that's relevant is that "sys.getfilesystemencoding" is a
> bit of a misnomer, as it's also used to determine the assumed encoding
> of command line arguments and environment variables.
>
>
Regarding sys.argv, AFAIK Unicode arguments work well on Python 3. Even
non-BMP characters are transferred correctly.


Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

2016-09-03 Thread Adam Bartoš
>
> The use of an ASCII compatible encoding is required to maintain
> compatibility with code that bypasses the TextIOWrapper and directly
> writes ASCII bytes to the standard streams (for example, 
> [process_stdinreader.py]
> <https://www.python.org/dev/peps/pep-0528/#process-stdinreader-py> ).
> Code that assumes a particular encoding for the standard streams other than
> ASCII will likely break.


Note that for example in IDLE there are sys.std* stream objects that don't
have buffer attribute. I would argue that it is incorrect to suppose that
there is always one.

Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

2016-09-03 Thread Adam Bartoš
Steve Dower (steve.dower at python.org) on Thu Sep 1 18:28:53 EDT 2016 wrote

I'm about to be offline for a few days, so I wanted to get my current
> draft PEPs out for people can read and review.
>
> I don't believe there is a lot of change as a result of either PEP, but
> the impact of what change there is needs to be weighed against the benefits.
>
> If anything, I'm likely to have underplayed the impact of this change
> (though I've had a *lot* of support for this one). Just stating my
> biases up-front - take it as you wish.
>
> See https://bugs.python.org/issue1602 for the current proposed patch for
> this PEP. I will likely update it after my upcoming flights, but it's in
> pretty good shape right now.
>
> Cheers,
> Steve
>
>
Did you consider that the hard-wired readline hook
`_PyOS_WindowsConsoleReadline` won't be needed in future if
http://bugs.python.org/issue17620 gets resolved so the default hook on
Windows just reads from sys.stdin? This would also reduce code duplicity
and all the Read/WriteConsoleW stuff would be gathered together in one
special class.

Regards,
Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8

2016-09-03 Thread Adam Bartoš
Paul Moore (p.f.moore at gmail.com) on Fri Sep 2 05:23:04 EDT 2016 wrote

>
> On 2 September 2016 at 03:35, Steve Dower  <https://mail.python.org/mailman/listinfo/python-dev>> wrote:
> >* I'd need to test to be sure, but writing an incomplete code point should
> *>* just truncate to before that point. It may currently raise OSError if that
> *>* truncated to zero length, as I believe that's not currently distinguished
> *>* from an error. What behavior would you propose?
> *
> For "correct" behaviour, you should retain the unwritten bytes, and
> write them as part of the next call (essentially making the API
> stateful, in the same way that incremental codecs work). I'm pretty
> sure that this could cause actual problems, for example I think invoke
> (https://github.com/pyinvoke/invoke) gets byte streams from
> subprocesses and dumps them direct to stdout in blocks (so could
> easily end up splitting multibyte sequences). It''s arguable that it
> should be decoding the bytes from the subprocess and then re-encoding
> them, but that gets us into "guess the encoding used by the
> subprocess" territory.
>
> The problem is that we're not going to simply drop some bad data in
> the common case - it's not so much the dropping of the start of an
> incomplete code point that bothers me, as the encoding error you hit
> at the start of the *next* block of data you send. So people will get
> random, unexplained, encoding errors.
>
> I don't see an easy answer here other than a stateful API.
>
>
Isn't the buffered IO wrapper for this?



> >* Reads of less than four bytes fail instantly, as in the worst case we need
> *>* four bytes to represent one Unicode character. This is an unfortunate
> *>* reality of trying to limit it to one system call - you'll never get a full
> *>* buffer from a single read, as there is no simple mapping between
> *>* length-as-utf8 and length-as-utf16 for an arbitrary string.
> *
> And here - "read a single byte" is a not uncommon way of getting some
> data. Once again see invoke:
> https://github.com/pyinvoke/invoke/blob/master/invoke/platform.py#L147
>
> used at
> https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L548
>
> I'm not saying that there's an easy answer here, but this *will* break
> code. And actually, it's in violation of the documentation: 
> seehttps://docs.python.org/3/library/io.html#io.RawIOBase.read
>
> """
> read(size=-1)
>
> Read up to size bytes from the object and return them. As a
> convenience, if size is unspecified or -1, readall() is called.
> Otherwise, only one system call is ever made. Fewer than size bytes
> may be returned if the operating system call returns fewer than size
> bytes.
>
> If 0 bytes are returned, and size was not 0, this indicates end of
> file. If the object is in non-blocking mode and no bytes are
> available, None is returned.
> """
>
> You're not allowed to return 0 bytes if the requested size was not 0,
> and you're not at EOF.
>
>

That's why it should be rather signaled by an exception. Even when one
doesn't transcode UTF-16 to UTF-8, reading just one byte is still
impossible I would argue that also incorrect here. I raise ValueError in
win_unicode_console.


Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python semantic: Is it ok to replace not x == y with x != y? (no)

2015-12-15 Thread Adam Bartoš
Hello,

the comparisons >=, <=, >, < cannot be optimized this way. Not every order
is a total order. For example, sets a = {1, 2} and b = {2, 3} are
incomparable (in the sense that both a >= b and a <= b is False), and it is
no pathology.

Regards, Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improving the reading part of REPL

2015-11-20 Thread Adam Bartoš
Another issue with the current implementation is
http://bugs.python.org/issue24829. Even if I fix my Python environment by
win_unicode_console so >>> "α" really results in "α" rather than "?", the
feature vanishes when I try to redirect stdout.

On Thu, Nov 19, 2015 at 10:50 PM, Adam Bartoš <dre...@gmail.com> wrote:

> It seems that there will be some refactoring of the tokenizer code.
> Regarding this, I'd like to recall my proposal on readline hooks. It would
> be nice if char* based PyOS_Readline API was replaced by a Python str based
> hook customizable by Python code. I propose to add function
> sys.readlinehook accepting optional prompt and returning a line read
> interactively from a user. There would also be sys.__readlinehook__
> containing the original value of sys.readlinehook (similarly to
> sys.(__)displayhook(__), sys.(__)excepthook(__) and
> sys.(__)std(in/out/err)(__)).
>
> Currently, the input is read from C stdin even if sys.stdin is changed
> (see http://bugs.python.org/issue17620). This complicates fixing
> http://bugs.python.org/issue1602 – the standard sys.std* streams are not
> capable of communicating in Unicode with Windows console, and replacing the
> streams with custom ones is not enough – one has also to install a custom
> readline hook, which is currently complicated. And even after installing a
> custom readine hook one finds out that Python tokenizer cannot handle
> UTF-16, so he has to wrap the custom stream objects just to let their
> encoding attribute have a different value, because readlinehook currently
> returns char* rather than a Python string. For more details see the
> documentation of my package: https://github.com/Drekin/win-unicode-console
> .
>
> The pyreadline package also sets up a custom readline so it would benefit
> if doing so would be easier. Moreover, the two consumers of PyOS_Readline
> API – the input function and the tokenizer – assume a different encoding of
> the bytes returned by the readlinehook. Effectively, one assumes
> sys.stdout.encoding and the other sys.stdin.encoding, so if these two are
> different, there is no way to implement a correct readline hook.
>
> If sys.readlinehook was added, the builting input function would be just a
> thin wrapper over sys.readlinehook removing the newline character and
> turning no input into EOFError. I thing that the best default value for
> sys.readlinehook on Windows would be stdio_readline – just write the prompt
> to sys.stdout and read a line from sys.stdin. On Linux, the default
> implementation would call GNU readline if it is available and sys.stdin and
> sys.stdout are standard TTYs (the check present in the current
> implementation of the input function), and it would call stdio_readline
> otherwise.
>
> Regards, Adam Bartoš
>
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Improving the reading part of REPL

2015-11-19 Thread Adam Bartoš
It seems that there will be some refactoring of the tokenizer code.
Regarding this, I'd like to recall my proposal on readline hooks. It would
be nice if char* based PyOS_Readline API was replaced by a Python str based
hook customizable by Python code. I propose to add function
sys.readlinehook accepting optional prompt and returning a line read
interactively from a user. There would also be sys.__readlinehook__
containing the original value of sys.readlinehook (similarly to
sys.(__)displayhook(__), sys.(__)excepthook(__) and
sys.(__)std(in/out/err)(__)).

Currently, the input is read from C stdin even if sys.stdin is changed (see
http://bugs.python.org/issue17620). This complicates fixing
http://bugs.python.org/issue1602 – the standard sys.std* streams are not
capable of communicating in Unicode with Windows console, and replacing the
streams with custom ones is not enough – one has also to install a custom
readline hook, which is currently complicated. And even after installing a
custom readine hook one finds out that Python tokenizer cannot handle
UTF-16, so he has to wrap the custom stream objects just to let their
encoding attribute have a different value, because readlinehook currently
returns char* rather than a Python string. For more details see the
documentation of my package: https://github.com/Drekin/win-unicode-console.

The pyreadline package also sets up a custom readline so it would benefit
if doing so would be easier. Moreover, the two consumers of PyOS_Readline
API – the input function and the tokenizer – assume a different encoding of
the bytes returned by the readlinehook. Effectively, one assumes
sys.stdout.encoding and the other sys.stdin.encoding, so if these two are
different, there is no way to implement a correct readline hook.

If sys.readlinehook was added, the builting input function would be just a
thin wrapper over sys.readlinehook removing the newline character and
turning no input into EOFError. I thing that the best default value for
sys.readlinehook on Windows would be stdio_readline – just write the prompt
to sys.stdout and read a line from sys.stdin. On Linux, the default
implementation would call GNU readline if it is available and sys.stdin and
sys.stdout are standard TTYs (the check present in the current
implementation of the input function), and it would call stdio_readline
otherwise.

Regards, Adam Bartoš
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-10 Thread Adam Bartoš
Glenn Linderman wrote:
 Is this going to get released in 3.5, I hope?  Python 3 is pretty
 limited without some solution for Unicode on the console... probably the
 biggest deficiency I have found in Python 3, since its introduction. It
 has great Unicode support for files and processing, which convinced me
 to switch from Perl, and I like so much else about it, that I can hardly
 code in Perl any more (I still support a few Perl programs, but have
 ported most of them to Python).

I'd love to see it included in 3.5, but I doubt that will happen. For one
thing, it's only two weeks till beta 1, which is feature freeze. And
mainly, my package is mostly hacking into existing Python environment. A
proper implementation would need some changes in Python someone would have
to do. See for example my proposal
http://bugs.python.org/issue17620#msg234439. I'm not competent to write a
patch myself and I have also no feedback to the proposed idea. On the other
hand, using the package is good enough for me so I didn't further bring
attention to the proposal.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-09 Thread Adam Bartoš
I already have a solution in Python 3 (see
https://github.com/Drekin/win-unicode-console,
https://pypi.python.org/pypi/win_unicode_console), I was just considering
adding support for Python 2 as well. I think I have an working example in
Python 2 using ctypes.

On Thu, May 7, 2015 at 9:23 PM, Martin v. Löwis mar...@v.loewis.de
wrote:

 Am 02.05.15 um 21:57 schrieb Adam Bartoš:
  Even if sys.stdin contained a file-like object with proper encoding
  attribute, it wouldn't work since sys.stdin has to be instance of type
  'file'. So the question is, whether it is possible to make a file
 instance
  in Python that is also customizable so it may call my code. For the first
  thing, how to change the value of encoding attribute of a file object.

 If, by in Python, you mean both in pure Python, and in Python 2,
 then the answer is no. If you can add arbitrary C code, then you might
 be able to hack your C library's stdio implementation to delegate fread
 calls to your code.

 I recommend to use Python 3 instead.

 Regards,
 Martin


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-02 Thread Adam Bartoš
I think I have found out where the problem is. In fact, the encoding of the
interactive input is determined by sys.stdin.encoding, but only in the case
that it is a file object (see
https://hg.python.org/cpython/file/d356e68de236/Parser/tokenizer.c#l890 and
the implementation of tok_stdin_decode). For example, by default on my
system sys.stdin has encoding cp852.

 u'á'
u'\xe1' # correct
 import sys; sys.stdin = foo
 u'á'
u'\xa0' # incorrect

Even if sys.stdin contained a file-like object with proper encoding
attribute, it wouldn't work since sys.stdin has to be instance of type
'file'. So the question is, whether it is possible to make a file instance
in Python that is also customizable so it may call my code. For the first
thing, how to change the value of encoding attribute of a file object.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-01 Thread Adam Bartoš
On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull step...@xemacs.org
wrote:

 Adam Bartoš writes:

   Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
   sys.std* streams are created with utf-8 encoding (which doesn't
   help on Windows since they still don't use ReadConsoleW and
   WriteConsoleW to communicate with the terminal) and after changing
   the sys.std* streams to the fixed ones and setting readline hook,
   it still doesn't work,

 I don't see why you would expect it to work: either your code is
 bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
 matter, or you're feeding already decoded text *as UTF-8* to your
 module which evidently expects something else (UTF-16LE?).


I'll describe my picture of the situation, which might be terribly wrong.
On Linux, in a typical situation, we have a UTF-8 terminal,
PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
The user is prompted  , during the input he can use autocompletion and
everything and he enters u'α'. PyOS_Readline returns bu'\xce\xb1' (as
char* or something), which is UTF-8 encoded input from the user. The
tokenizer, parser, and evaluator process the input and the result is
u'\u03b1', which is printed as an answer.

In my case I install custom sys.std* objects and a custom readline hook.
Again, the tokenizer calls PyOS_Readline, which calls my readline hook,
which calls sys.stdin.readline(), which returns an Unicode string a user
entered (it was decoded from UTF-16-LE bytes actually). My readline hook
encodes this string to UTF-8 and returns it. So the situation is the same.
The tokenizer gets b\u'xce\xb1' as before, but know it results in
u'\xce\xb1'.

Why is the result different? I though that in the first case
PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I
thought that PYTHONIOENCODING=utf-8 is the thing that also sets
PyCF_SOURCE_IS_UTF8.



   so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

 I don't think that flag does what you think it does.  AFAICT from
 looking at the source, that flag gets unconditionally set in the
 execution context for compile, eval, and exec, and it is checked in
 the parser when creating an AST node.  So it looks to me like it
 asserts that the *internal* representation of the program is UTF-8
 *after* transforming the input to an internal representation (doing
 charset decoding, removing comments and line continuations, etc).


I thought it might do what I want because of the behaviour of eval. I
thought that the PyUnicode_AsUTF8String call in eval just encodes the
passed unicode to UTF-8, so the situation looks like follows:
eval(uu'\u031b') - (bu'\xce\xb1', PyCF_SOURCE_IS_UTF8 set) - u'\u03b1'
eval(uu'\u031b'.encode('utf-8')) - (bu'\xce\xb1', PyCF_SOURCE_IS_UTF8
not set) - u'\xce\xb1'
But of course, this my picture might be wrong.


  Well, the received text comes from sys.stdin and its encoding is
   known.

 How?  You keep asserting this.  *You* know, but how are you passing
 that information to *the Python interpreter*?  Guido may have a time
 machine, but nobody claims the Python interpreter is telepathic.


I thought that the Python interpreter knows the input comes from sys.stdin
at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject
the encoding for the tokenizer is inferred from sys.stdin.encoding. But
this is actually the case only in Python 3. So I was wrong.


  Yes. In the latter case, eval has no idea how the bytes given are
   encoded.

 Eval *never* knows how bytes are encoded, not even implicitly.  That's
 one of the important reasons why Python 3 was necessary.  I think you
 know that, but you don't write like you understand the implications
 for your current work, which makes it hard to communicate.


Yes, eval never knows how bytes are encoded. But I meant it in comparison
with the first case where a Unicode string was passed.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-30 Thread Adam Bartoš
 does this not work for you?

 from __future__ import unicode_literals

No, with unicode_literals I just don't have to use the u'' prefix, but the
wrong interpretation persists.


On Thu, Apr 30, 2015 at 3:03 AM, Stephen J. Turnbull step...@xemacs.org
wrote:


 IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in
 the environment does what you want.


Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std*
streams are created with utf-8 encoding (which doesn't help on Windows
since they still don't use ReadConsoleW and WriteConsoleW to communicate
with the terminal) and after changing the sys.std* streams to the fixed
ones and setting readline hook, it still doesn't work, so presumably the
PyCF_SOURCE_IS_UTF8 is still not set.



 Regarding your environment, the repeated use of custom is a red
 flag.  Unless you bundle your whole environment with the code you
 distribute, Python can know nothing about that.  In general, Python
 doesn't know what encoding it is receiving text in.


Well, the received text comes from sys.stdin and its encoding is known.
Ideally, Python would recieve the text as Unicode String object so there
would be no problem with encoding (see
http://bugs.python.org/issue17620#msg234439 ).


If you *do* know, you can set PyCF_SOURCE_IS_UTF8.  So if you know
 that all of your users will have your custom stdio and readline hooks
 installed (AFAICS, they can't use IDLE or IPython!), then you can
 bundle Python built with the flag set, or perhaps you can do the
 decoding in your custom stdio module.


The custom stdio streams and readline hooks are set at runtime by a code in
sitecustomize. It does not affect IDLE and it is compatible with IPython. I
would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via
ctypes. But this may be impossible.



 Note that even if you have a UTF-8 input source, some users are likely
 to be surprised because IIRC Python doesn't canonicalize in its
 codecs; that is left for higher-level libraries.  Linux UTF-8 is
 usually NFC normalized, while Mac UTF-8 is NFD normalized.


Actually, I have a UTF-16-LE source, but that is not important since it's
decoted to Python Unicode string object. I have this Unicode string and I'm
to return it from the readline hook, but I don't know how to communicate it
to the caller – the tokenizer – so it is interpreted correctly. Note that
the following works:

 eval(raw_input('~~ '))
~~ u'α'
u'\u03b1'

Unfortunatelly, the REPL works differently than eval/exec on raw_input. It
seems that the only option is to bypass the REPL by a custom REPL (e.g.
based on code.InteractiveConsole). However, wrapping up the execution of a
script, so that the custom REPL is invoked at the right place, is
complicated.


   Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
 Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 Just to be clear, you accept those results as correct, right?


Yes. In the latter case, eval has no idea how the bytes given are encoded.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš
This situation is a bit different from coding cookies. They are used when
we have bytes from a source file, but we don't know its encoding. During
interactive session the tokenizer always knows the encoding of the bytes. I
would think that in the case of interactive session the PyCF_SOURCE_IS_UTF8
should be always set so the bytes containing encoded non-ASCII characters
are interpreted correctly. Why I'm talking about PyCF_SOURCE_IS_UTF8?
eval(uu'\u03b1') - u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) -
u'\xce\xb1'. I understand that in the second case eval has no idea how are
the given bytes encoded. But the first case is actually implemented by
encoding to utf-8 and setting PyCF_SOURCE_IS_UTF8. That's why I'm talking
about the flag.

Regards, Drekin

On Wed, Apr 29, 2015 at 9:25 AM, Nick Coghlan ncogh...@gmail.com wrote:

 On 29 April 2015 at 06:20, Adam Bartoš dre...@gmail.com wrote:
  Hello,
 
  is it possible to somehow tell Python 2.7 to compile a code entered in
 the
  interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm
 considering
  adding support for Python 2 in my package
  (https://github.com/Drekin/win-unicode-console) and I have run into the
 fact
  that when uα is entered in the interactive session, it results in
  u\xce\xb1 rather than u\u03b1. As this seems to be a highly
 specialized
  question, I'm asking it here.

 As far as I am aware, we don't have the equivalent of a coding
 cookie for the interactive interpreter, so if anyone else knows how
 to do it, I'll be learning something too :)

 Cheers,
 Nick.

 --
 Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš
Yes, that works for eval. But I want it for code entered during an
interactive session.

 u'α'
u'\xce\xb1'

The tokenizer gets bu'\xce\xb1' by calling PyOS_Readline and it knows
it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of
how eval works, I believe that it would work correctly if the
PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there
is a way to set it. Also, my naive thought is that it should be always set
in the case of interactive session.


On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner victor.stin...@gmail.com
wrote:

 Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
  Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
 u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 There is a simple option to get this flag: call eval() with unicode, not
 with encoded bytes.

 Victor

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš
I am in Windows and my terminal isn't utf-8 at the beginning, but I install
custom sys.std* objects at runtime and I also install custom readline hook,
so the interactive loop gets the input from my stream objects via
PyOS_Readline. So when I enter u'α', the tokenizer gets bu'\xce\xb1',
which is the string encoded in utf-8, and sys.stdin.encoding == 'utf-8'.
However, the input is then interpreted as u'\xce\xb1' instead of u'\u03b1'.

On Wed, Apr 29, 2015 at 6:40 PM, Guido van Rossum gu...@python.org wrote:

 I suspect the interactive session is *not* always in UTF8. It probably
 depends on the keyboard mapping of your terminal emulator. I imagine in
 Windows it's the current code page.

 On Wed, Apr 29, 2015 at 9:19 AM, Adam Bartoš dre...@gmail.com wrote:

 Yes, that works for eval. But I want it for code entered during an
 interactive session.

  u'α'
 u'\xce\xb1'

 The tokenizer gets bu'\xce\xb1' by calling PyOS_Readline and it knows
 it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of
 how eval works, I believe that it would work correctly if the
 PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there
 is a way to set it. Also, my naive thought is that it should be always set
 in the case of interactive session.


 On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner victor.stin...@gmail.com
  wrote:

 Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
  Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
 u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 There is a simple option to get this flag: call eval() with unicode, not
 with encoded bytes.

 Victor



 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/guido%40python.org




 --
 --Guido van Rossum (python.org/~guido)

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Unicode literals in Python 2.7

2015-04-28 Thread Adam Bartoš
Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in the
interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering
adding support for Python 2 in my package (
https://github.com/Drekin/win-unicode-console) and I have run into the fact
that when uα is entered in the interactive session, it results in
u\xce\xb1 rather than u\u03b1. As this seems to be a highly specialized
question, I'm asking it here.

Regards, Drekin
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com