[issue12486] tokenize module should have a unicode API

2018-06-05 Thread Thomas Kluyver


Thomas Kluyver  added the comment:

Thanks Carol :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-06-05 Thread Carol Willing


Change by Carol Willing :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-06-05 Thread Carol Willing


Carol Willing  added the comment:


New changeset c56b17bd8c7a3fd03859822246633d2c9586f8bd by Carol Willing (Thomas 
Kluyver) in branch 'master':
bpo-12486: Document tokenize.generate_tokens() as public API (#6957)
https://github.com/python/cpython/commit/c56b17bd8c7a3fd03859822246633d2c9586f8bd


--
nosy: +willingc

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-28 Thread Thomas Kluyver


Thomas Kluyver  added the comment:

The tests on PR #6957 are passing now, if anyone has time to have a look. :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Thomas Kluyver

Thomas Kluyver  added the comment:

Thanks - I had forgotten it, just fixed it now.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Martin Panter

Martin Panter  added the comment:

Don’t forget about updating __all__.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Thomas Kluyver

Thomas Kluyver  added the comment:

I agree, it's not a good design, but it's what's already there; I just want to 
ensure that it won't be removed without a deprecation cycle. My PR makes no 
changes to behaviour, only to documentation and tests.

This and issue 9969 have both been around for several years. A new tokenize API 
is clearly not at the top of anyone's priority list - and that's fine. I'd 
rather have *some* unicode API today than a promise of a nice unicode API in 
the future. And it doesn't preclude adding a better API later, it just means 
that the existing API would have to have a deprecation cycle.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

My concern is that we will have two functions with non-similar names 
(tokenize() and generate_tokens()) that does virtually the same, but accept 
different types of input (bytes or str), and the single function untokenize() 
that produces different type of result depending on the value of input. This 
doesn't look like a good design to me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Thomas Kluyver

Thomas Kluyver  added the comment:

I wouldn't say it's a good name, but I think the advantage of documenting an 
existing name outweighs that. We can start (or continue) using 
generate_tokens() right away, whereas a new name presumably wouldn't be 
available until Python 3.8 comes out. And we usually don't require a new Python 
version until a couple of years after it is released.

If we want to add better names or clearer APIs on top of this, great. But I 
don't want that discussion to hold up the simple step of committing to keep the 
existing API.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
nosy: +barry, mark.dickinson, michael.foord, trent
versions: +Python 3.8 -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-18 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

The old generate_tokens() was renamed to tokenize() in issue719888 because the 
latter is a better name. Is "generate_tokens" considered a good name now?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-17 Thread Thomas Kluyver

Change by Thomas Kluyver :


--
pull_requests: +6616

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-05-17 Thread Matthias Bussonnier

Matthias Bussonnier  added the comment:

> Why not just bless the existing generate_tokens() function as a public API, 

Yes please, or just make the private `_tokenize` public under another name. The 
`tokenize.tokenize` method try to magically detect encoding which may be 
unnecessary.

--
nosy: +mbussonn

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2018-03-11 Thread Thomas Kluyver

Thomas Kluyver  added the comment:

> Why not just bless the existing generate_tokens() function as a public API

We're actually using generate_tokens() from IPython - we wanted a way to 
tokenize unicode strings, and although it's undocumented, it's been there for a 
number of releases and does what we want. So +1 to promoting it to a public API.

In fact, at the moment, IPython has its own copy of tokenize to fix one or two 
old issues. I'm trying to get rid of that and use the stdlib module again, 
which is how I came to notice that we're using an undocumented API.

--
nosy: +takluyver

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2015-10-05 Thread Martin Panter

Martin Panter added the comment:

I didn’t notice that this dual untokenize() behaviour already existed. Taking 
that into account weakens my argument for having separate text and bytes 
tokenize() functions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2015-10-05 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Thank you for your review Martin.

Here is rebased patch that addresses Matin's comments.

I agree that having untokenize() changes its output type depending on the 
ENCODING token is bad design and we should change this. But this is perhaps 
other issue.

--
versions: +Python 3.6 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2015-10-05 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


Added file: http://bugs.python.org/file40679/tokenize_str_2.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2015-10-04 Thread Martin Panter

Martin Panter added the comment:

I agree it would be very useful to be able to tokenize arbitrary text without 
worrying about encoding tokens. I left some suggestions for the documentation 
changes. Also some test cases for it would be good.

However I wonder if a separate function would be better for the text mode 
tokenization. It would make it clearer when an ENCODING token is expected and 
when it isn’t, and would avoid any confusion about what happens when readline() 
returns a byte string one time and a text string another time. Also, having 
untokenize() changes its output type depending on the ENCODING token seems like 
bad design to me.

Why not just bless the existing generate_tokens() function as a public API, 
perhaps renaming it to something clearer like tokenize_text() or 
tokenize_text_lines() at the same time?

--
nosy: +martin.panter
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2012-12-28 Thread Meador Inge

Meador Inge added the comment:

See also issue9969.

--
nosy: +meador.inge

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2012-10-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Patch to allow tokenize() accepts string is very simple, only 4 lines. But it 
requires a lot of documentation changes.

Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools 
use only generate_tokens(), none uses tokenize(). Of course, it will be better 
if tokenize() will work with iterator protocol.

Here is a preliminary patch. I will be thankful for the help with the 
documentation and for the discussion.

Of course, it will be better if tokenize() will work with iterator protocol.

--
keywords: +patch
nosy: +serhiy.storchaka
versions: +Python 3.4 -Python 3.3
Added file: http://bugs.python.org/file27587/tokenize_str.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2012-10-13 Thread Eric Snow

Changes by Eric Snow ericsnowcurren...@gmail.com:


___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-09 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser 
has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE.

Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or 
PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8.

Rewrite the parser to work directly on Unicode is much more complex and I don't 
think that we need that.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-09 Thread Eric Snow

Changes by Eric Snow ericsnowcurren...@gmail.com:


--
nosy: +ericsnow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-08 Thread Petri Lehtinen

Changes by Petri Lehtinen pe...@digip.org:


--
nosy: +petri.lehtinen

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-08 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

Hmm. Python 3 code is unicode. Python reads program text as Unicode code 
points. The tokenize module purports to provide a lexical scanner for Python 
source code. But it seems not to do that. Instead it provides a scanner for 
Python code encoded as bytes, which is something different. So this is at least 
a doc update issue (which affects 2.7/3.2 also). Another doc issue is given 
below.

A deeper problem is that tokenize uses the semi-obsolete readline protocol, 
which probably dates to 1.0 and which expects the source to be a file or 
file-like. The more recent iterator protocol would lets the source be anything. 
A modern tokenize function should accept an iterable of  strings. This would 
include but not be limited to a file opened in text mode.

A related problem is that 'tokenize' is a convenience function that does 
several things bundled together.
1. Read lines as bytes from a file-like source.
2. Detect encoding.
3. Decode lines to strings.
4. Actually tokenize the strings to tokens.

I understand this feature request to be a request that function 4, the actual 
Python 3 code tokenizer be unbundled and exposed to users. I agree with this 
request. Any user that starts with actual Py3 code would benefit.

(Compile() is another function that bundles a tokenizer.)

Back to the current doc and another doc problem. The entry for untokenize() 
says Converts tokens back into Python source code. ...The reconstructed script 
is returned as a single string. That would be nice if true, but I am going to 
guess it is not, as the entry continues It returns bytes, encoded using the 
ENCODING token,. In Py3, string != bytes, so this seems an incomplete doc 
conversion from Py2.

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-04 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +eric.araujo, haypo
type:  - feature request
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12486] tokenize module should have a unicode API

2011-07-03 Thread Devin Jeanpierre

New submission from Devin Jeanpierre jeanpierr...@gmail.com:

tokenize only deals with bytes. Users might want to deal with unicode source 
(for example, if python source is embedded into a document with an 
already-known encoding).

The naive approach might be something like:

  def my_readline():
  return my_oldreadline().encode('utf-8')

But this doesn't work for python source that declares its encoding, which might 
be something other than utf-8. The only safe ways are to either manually add a 
coding line yourself (there are lots of ways, I picked a dumb one):

  def my_readline_safe(was_read=[]):
  if not was_read:
  was_read.append(True)can 
  return b'# coding: utf-8'
  return my_oldreadline().encode('utf-8')

  tokenstream = tokenize.tokenize(my_readline_safe)

Or to use the same my_readline as before (no added coding line), but instead of 
passing it to tokenize.tokenize, you could pass it to the undocumented 
_tokenize function:

tokenstream = tokenize._tokenize(my_readline, 'utf-8')

Or, ideally, you'd just pass the original readline that produces unicode into a 
utokenize function:

tokenstream = tokenize.utokenize(my_oldreadline)

--
components: Library (Lib)
messages: 139733
nosy: Devin Jeanpierre
priority: normal
severity: normal
status: open
title: tokenize module should have a unicode API

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12486
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com