[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2019-03-22 Thread Terry J. Reedy


Change by Terry J. Reedy :


--
assignee:  -> terry.reedy
components: +IDLE

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2015-12-15 Thread Tal Einat

Tal Einat added the comment:

It turns out that staying with str.translate in PyParse payed off! An 
optimization in 3.5 (issue21118) has made it much much faster on ASCII-only 
inputs, which are the vast majority in this case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-16 Thread Martin v . Löwis

Martin v. Löwis added the comment:

LGTM

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-16 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 8b3f7aecdf85 by Tal Einat in branch '3.4':
Issue #21765: Add support for non-ascii identifiers to HyperParser
http://hg.python.org/cpython/rev/8b3f7aecdf85

New changeset 73a8c614af4d by Tal Einat in branch 'default':
Issue #21765: Add support for non-ascii identifiers to HyperParser
http://hg.python.org/cpython/rev/73a8c614af4d

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-16 Thread Tal Einat

Tal Einat added the comment:

Fix committed to 3.4 and merged to default. (My first commit!)

Not back-porting this to 2.7 despite PEP 434, because support for non-ASCII 
identifiers only exists in 3.x.

Close this issue as fixed!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-16 Thread Terry J. Reedy

Changes by Terry J. Reedy tjre...@udel.edu:


--
resolution:  - fixed
stage: test needed - resolved
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-15 Thread Tal Einat

Tal Einat added the comment:

I'm attaching a patch which really fixes this issue, along with additional 
tests for idlelib.HyperParser.

I did indeed have to fix PyParse as well. I got it working with re.subn() as 
Martin suggested, but the performance was much worse (between 100x and 1000x 
slower according to my timings). So I came up with the clever solution of 
passing a defaultdict to str.translate(), with the default value being 
ord('x'). That works as expected and is much faster than the regexp.

Finally, since the defaultdict is kept around as long as IDLE is running, I 
decided to avoid having it grow continually and consume memory unnecessarily. 
So I wrote a simple Mapping class, which wraps a normal dict and uses a custom 
default value instead of None, ord('x') in this case. Works like a charm :)

I haven't written tests for PyParse since it currently doesn't have any tests 
at all. But from the HyperParser tests and from some manual testing, it seems 
to be working as expected.

--
Added file: 
http://bugs.python.org/file35959/taleinat.20140716.IDLE_HyperParser_unicode_ids.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-15 Thread Tal Einat

Changes by Tal Einat talei...@gmail.com:


Removed file: 
http://bugs.python.org/file35959/taleinat.20140716.IDLE_HyperParser_unicode_ids.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-15 Thread Tal Einat

Changes by Tal Einat talei...@gmail.com:


Added file: 
http://bugs.python.org/file35960/taleinat.20140716.IDLE_HyperParser_unicode_ids_v2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-11 Thread Tal Einat

Tal Einat added the comment:

If you think ColorDelegator and UndoDelegator should be fixed as well, I'd be 
happy to take a look, but we should open a separate tracker issue for it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-11 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I did not open another issue yet because I did not want to split the general 
discussion of parsing 3.x unicode-based python. We might also need another 
issue for idlelib/PyParse, and that might need to come first.

What I think is that Idle should have at 1 definition of 'identifier' and not a 
least 3, in 3 separate places ;-). Hyperparser seems like the most appropriate 
place if there is to be more than str.isidentifier.

That is not to say that equivalents versions might be needed for different 
uses. ColorDelegator, as it is, needs an re version to be combined with other 
regexes. I have not looked yet at what UndoDelegator does with its character 
classification function and whether str.isidentifier could be used.  
Hyperparsar seems the hardest since it parses backwards.

I do not consider this issue to be the highest priority, at the moment, but we 
have collected enough info with help from Ezio and Martin to do some 
experiments.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-10 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I just noticed that ColorDelegator has
 idprog = re.compile(r\s+(\w+), re.S)
which will recognize unicode 'words', if not exactly Python 'identifiers'.

However, UndoDelegator has 
alphanumeric = string.ascii_letters + string.digits + _
which is the same as in Hyperparser. It is used in

def classify(self, c):
if c in self.alphanumeric:
return alphanumeric
if c == \n:
return newline
return punctuation
and probably does not do what we really want.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-07 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Two observations:
1. This issue is only about identifiers. So processing of string literals is 
technically out of scope.
2. I'd suggest to replace .translate with regular expressions:

py re.sub('[^(){}\[\]]','','foo(b[a]{r}≠)')
'([]{})'

I'm sure people interested in performance will want to time this approach.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-07 Thread Tal Einat

Tal Einat added the comment:

@Martin:

 1. This issue is only about identifiers. So processing of
 string literals is technically out of scope.

I added a test with a non-ASCII string literal only for good measure, while I 
was already adding a test with a non-ASCII identifier. The patch doesn't change 
anything in that regard. Supporting Unicode string literals could be considered 
a separate issue, though the root cause of the problems is the same, and fixing 
it properly will fix both issues.

 2. I'd suggest to replace .translate with regular expressions:

Indeed, something along those lines should work well, and probably would be 
fast enough.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-07-06 Thread Tal Einat

Tal Einat added the comment:

Indeed, I seem to have been misinterpreting the grammar, despite taking care 
and reading it several times. This strengthens my opinion that we should use 
str.isidentifier() rather than attempt to correctly re-implement just the parts 
that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of 
my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion 
of checking for valid identifier characters using ('a' + 
string_part).isidentifier(). It also employs his suggestion of how to avoid 
executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with 
non-ASCII Unicode strings. This is because it uses PyParse, which doesn't 
support Unicode! For example, it apparently replaces all non-ASCII characters 
with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

quote
The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with x.  This is safe
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports
deletechars.
/quote

Properly resolving this issue will apparently require fixing PyParse to 
properly support Unicode.

.. [1]: 
http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117

--
keywords: +patch
Added file: 
http://bugs.python.org/file35876/taleinat.20140706.IDLE_HyperParser_unicode_ids.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-22 Thread Martin v . Löwis

Martin v. Löwis added the comment:

The reason the Unicode consortium made this list (Other_ID_Start) is that they 
want to promise 100% backwards compatibility: if some programming language had 
been using UAX#31, changes to the Unicode database might break existing code. 
To avoid this, UAX#31 guarantees 100% stability.

The reason Python uses it is because it uses UAX#31, with the minimum number of 
modifications. We really shouldn't be making arbitrary changes to it. If we 
would e.g. say that we drop these four characters now, the next Unicode version 
might add more characters to Other_ID_Start, and then we would have to say that 
we include some, but not all, characters from Other_ID_Start.

So if IDLE wants to reimplement the XID_Start and XID_Continue properties, it 
should do it correctly. Note that the proposed patch only manages to replicate 
the ID_Start and ID_Continue properties. For the XID versions, see

http://www.unicode.org/reports/tr31/#NFKC_Modifications

Unfortunately, the specification doesn't explain exactly how these 
modifications are performed. For item 1, I think it is:

Characters which are in ID_Start (because they count as letters) but their NFKC 
decomposition does not start with an ID_Start character (because it starts with 
a modifier instead) are removed in XID_Start

For item 2, they unfortunately don't list all characters that get excluded. For 
the one example that they do give, the reason is clear: U+037A (GREEK 
YPOGEGRAMMENI, category Lm) decomposes to U+0020 (SPACE) U+0345 (COMBINING 
GREEK YPOGEGRAMMENI). Having a space in an identifier is clearly out of the 
question. I assume similar problems occur with certain Arabic presentation 
forms. I wish the consortium was more explicit as to what precise algorithms 
they use to derive their derived properties.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-22 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Tal: If you want to verify your is_id_char function, you could use the code

for i in range(65536):
c = chr(i)
c2 = 'a'+c
if is_id_char(c) != c2.isidentifier():
print('\\u%.4x'%i,is_id_char(c),c2.isidentifier())

Alternatively, you could use the strategy taken in that code for is_id_char 
itself:

def is_id_char(c):
  return ('a'+c).isidentifier()

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-22 Thread Tal Einat

Tal Einat added the comment:

 Note that the proposed patch only manages to replicate the
 ID_Start and ID_Continue properties.

Is this just because of the mishandling of the Other_ID_Start and 
Other_ID_Continue properties, or something else as well? I based my code on the 
definitions in:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

Are those actually wrong?


Note that my code uses category(normalize(char)[0]), so it's always making sure 
that the first character is valid. Actually, though, I now realize that it 
should check all of the values returned by normalize().

Regarding testing ('a'+something).isidentifier(), Terry already suggested 
something along those lines. I think I'll end up using something of the sort, 
to avoid adding additional complex Unicode-related code to maintain in the 
future.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-22 Thread Martin v . Löwis

Martin v. Löwis added the comment:

I think you are misinterpreting the grammar. Your code declares that U+00B2 
(SUPERSCRIPT TWO, ²) is an identifier character. Its category is No, so it is 
actually not. However, its normalization is U+0032 (DIGIT TWO, 2), which is an 
identifier character - but that doesn't make SUPERSCRIPT TWO a member of 
XID_Continue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-21 Thread Tal Einat

Tal Einat added the comment:

Alright, so I'm going to use the equivalent of the following code, unless 
someone can tell me that something is wrong:


from keyword import iskeyword
from unicodedata import category, normalize

_ID_FIRST_CATEGORIES = {Lu, Ll, Lt, Lm, Lo, Nl,
Other_ID_Start}
_ID_CATEGORIES = _ID_FIRST_CATEGORIES | {Mn, Mc, Nd, Pc,
 Other_ID_Continue}

_ASCII_ID_CHARS = set(string.ascii_letters + string.digits + _)
_ID_KEYWORDS = {True, False, None}

def is_id_char(char):
return char in _ASCII_ID_CHARS or (
ord(char) = 128 and
category(normalize(char)[0]) in _ID_CATEGORIES
)

def is_identifier(id_candidate):
return id_candidate.isidentifier() and (
(not iskeyword(id_candidate)) or
id_candidate in _ID_KEYWORDS
)

 def _eat_identifier(str, limit, pos):
i = pos
while i  limit and is_id_char(str[pos - i]):
i -= 1
if i  pos and not is_identifier(str[i:pos]):
return 0
return pos - i

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-21 Thread Ezio Melotti

Ezio Melotti added the comment:

 _ID_FIRST_CATEGORIES = {Lu, Ll, Lt, Lm, Lo, Nl,
 Other_ID_Start}
 _ID_CATEGORIES = _ID_FIRST_CATEGORIES | {Mn, Mc, Nd, Pc,
  Other_ID_Continue}

Note that Other_ID_Start and Other_ID_Continue are not categories -- they 
are properties -- and that unicodedata.category() won't return them, so adding 
them to these set won't have any effect.  I don't think there's a way to check 
if chars have that property, but as I said in my previous message it's probably 
safe to ignore them (nothing will explode even in the unlikely case that those 
chars are used, right?).

 def is_id_char(char):
return char in _ASCII_ID_CHARS or (
ord(char) = 128 and

What's the reason for checking if the ord is = 128?

category(normalize(char)[0]) in _ID_CATEGORIES
)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-21 Thread Martin v . Löwis

Martin v. Löwis added the comment:

The Other_ID_Start property is defined in

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

It currently includes 4 characters.

However, I would propose that methods .isidstart and .isidcontinue get added to 
the str type if there is a need for them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-21 Thread Tal Einat

Tal Einat added the comment:

 What's the reason for checking if the ord is = 128?

It's an optimization. Assuming the majority of characters will be ASCII, most 
non-identifier characters will fail this test, thus avoiding the more involved 
generic Unicode check.

 However, I would propose that methods .isidstart and
 .isidcontinue get added to the str type if there is a
 need for them.

I was surprised to find .isidentifier() is a method of the str type. Even if 
that is considered useful enough to be a method of str, I don't think the 
functions you suggest are generally useful enough to warrant being added as 
methods.

Then again, I'm no authority on decided what gets included as methods of base 
types. Regardless, I'd rather this patch go in without affecting str; adding 
these methods to str is a separate issue. If someone decides to pursue that, 
feel free to use this code and/or +nosy me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-21 Thread Ezio Melotti

Ezio Melotti added the comment:

 It's an optimization. Assuming the majority of characters will be
 ASCII, most non-identifier characters will fail this test, thus
 avoiding the more involved generic Unicode check.

I don't know what kind of characters are usually received as input.  If things 
like (ASCII) spaces, parentheses, commas are common, then the optimization is 
probably OK.

@Martin
Do you know the reason why characters with the Other_ID_Start have been 
included in the first place, given that they are no longer considered valid 
identifiers and I can hardly think any situation where someone would need it?  
Could they be removed from 3.5 if that makes the code simpler?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-19 Thread Ezio Melotti

Ezio Melotti added the comment:

 I'm not sure what the Other_ID_Start property mentioned in [1] and
 [2] means, though. Can we get someone with more in-depth knowledge of
 unicode to help with this? 

See http://www.unicode.org/reports/tr31/#Backward_Compatibility.
Basically they were considered valid ID_Start characters in previous versions 
of Unicode, but they are no longer valid.  I think it's safe to leave them out 
(perhaps they could/should be removed from the Python parser too), but if you 
want to add them the list includes only 4 characters (there are 12 more for 
Other_ID_Continue).

 The real question is how to do this *fast*, since HyperParser does a
 *lot* of these checks. Do you think caching would be a good approach?

I think it would be enough to check explicitly for ASCII chars, since most of 
them will be ASCII anyway.  If they are not ASCII you can use 
unicodedata.category (or .isidentifier() if it does the right thing).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-15 Thread Tal Einat

Tal Einat added the comment:

It seems that the unicodedata module already supplies relevant functions which 
can be used for this. For example, we can replace char in 
self._id_first_chars with something like:

from unicodedata import normalize, category
norm_char = normalize(char)[0]
is_id_first_char = norm_char_first == '_' or category(norm_char_first) in 
{Lu, Ll, Lt, Lm, Lo, Nl}

I'm not sure what the Other_ID_Start property mentioned in [1] and [2] means, 
though. Can we get someone with more in-depth knowledge of unicode to help with 
this? 

The real question is how to do this *fast*, since HyperParser does a *lot* of 
these checks. Do you think caching would be a good approach?

See:
.. [1]: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
.. [2]: http://legacy.python.org/dev/peps/pep-3131/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-15 Thread Tal Einat

Tal Einat added the comment:

Bah, I messed up the code sample in my previous message. It was supposed to be:

from unicodedata import normalize, category
norm_char_first = normalize(char)[0]
is_id_first_char = (
norm_char_first == '_' or
category(norm_char_first) in {Lu, Ll, Lt, Lm, Lo, Nl}
)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-15 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I checked for usage: _id(_first)_chars is only used in _eat_identifier, which 
is used in one place in get_expression. That is called once each in 
AutoComplete and CallTips. Both are seldom visible accept as requested (by 
waiting, for calltips). Calltips is only called on entry of '('. Is 
AutoComplete doing hidden checks?

_eat_identifier currently does a linear 'c in string' scan of the 2 char 
strings. I believe that both are long enough that O(1) 'c in set' checks would 
be faster. The sets could be augmented with latin1 id chars without becoming 
hugh or slowing the check (see below). This is a change we could make as soon 
as the test file and new failing tests are ready.

I just discovered, new in 3.x, str.isidentifier.
 '1'.isidentifier()
False
 'a'.isidentifier()
True
 '\u'
'쳌'
 '\u'.isidentifier()
True

This is, however, meant to be used in the forward direction. If 
s[pos].isidentifier(), check s[pos:end].identifier(), where end is 
progressively incremented until the check fails. For backwards checking, it 
could be used with a start char prefixed: ('a'+s[start:pos]).isidentifier(). To 
limit the cost, the start decrement could be 4 chars at a time, with 2 extra 
tests (binary search) on failure to find the actual start.

The 3.x versions of other isxyg functions could be useful: isalpha, isdecimal, 
isdigit, isnumeric. We just have to check their definitions against the two 
identifier class definitions.

What is slightly annoying is that in CPython 3.3+, all-ascii strings are marked 
as such but the info is not directly accessible without without ctypes. I 
believe all-latin-1 strings can be detected by comparing sys.getsizeof(s) to 
len(s), so we could augment the char sets to include the extra identifier chars 
in latin-1.

We could add a configuation option to assume all-ascii (or better, all-latin1 
code chars or not, and note that 'all latin1' will run faster but not recognize 
identifiers for the two features that use this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-15 Thread Tal Einat

Tal Einat added the comment:

AutoComplete isn't doing hidden checks. My concern is that auto-completion 
happens automatically and the parsing is done synchronously, so if the parsing 
takes a significant amount of time it can cause a delay long enough to be 
noticeable by users. We should also consider the IDLE is being used on some 
relatively weak computers, such as Raspberry Pi. I have one for testing if 
needed!

Your suggestion of jumping backwards several characters at a time and calling 
isidentifier() seems problematic. For example, how would it deal with something 
like 2fast2furious? Recognizing fast2furious as the identifier candidate is 
incorrect, but that's what we'd get with a straightforward implementation of 
your suggestion. I can't think of a way to get this right without be able to 
check if each character is valid for identifiers.

I still think the category(normalize(char)[0]) in {...} approach could be just 
fine. I'd really like to get feedback from an expert on unicode. The experts 
index lists three people as experts on unicodedata; I've added them to the nosy 
for this issue.

Once we have an identifier candidate, the code currently just checks that the 
first char is valid and that the identifier isn't a keyword. We probably should 
use str.isidentifier() instead of just checking the first character.

--
nosy: +ezio.melotti, lemburg, loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-14 Thread Terry J. Reedy

New submission from Terry J. Reedy:

idlelib.HyperParser.Hyperparser has these lines
_whitespace_chars =  \t\n\\
_id_chars = string.ascii_letters + string.digits + _
_id_first_chars = string.ascii_letters + _
used in _eat_identifier() and get_expression. At least the latter two should be 
turned into functions that access the unicode datebase. (Such functions, if not 
already present in the stdlib, should be. But adding something elsewhere would 
be another issue.)

--
messages: 220593
nosy: taleinat, terry.reedy
priority: normal
severity: normal
stage: test needed
status: open
title: Idle: make 3.x HyperParser work with non-ascii identifiers.
type: behavior
versions: Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

2014-06-14 Thread Terry J. Reedy

Terry J. Reedy added the comment:

#21686 adds the test file that a new test would go in.

--
dependencies: +IDLE - Test hyperparser

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21765
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com