[issue14200] Idle shell crash on printing non-BMP unicode character

2012-03-06 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for mixing the different problems, these were somehow things I noticed 
at once in the new python version, but I should have noticed the different 
domains myself.
I still might not understand the term crash properly - I just meant to 
distinguish between a single appropriate exception on an invalid operation 
(while the app is staying alive and works on next valid input) - as is the case 
with calling through python.exe, and - on the other hand - the immediate 
termination on encountering the invalid input, which happens with pythonw.exe.

Now I see, that with pythonw a tk app terminates with the first exception (in 
general) in py 3.3 and also 3.2 (as opposed to py 2.7, where it just swallows 
the exception and stays alive, as one would probably expect).

Should this be reported in a separate issue, or is this what remains relevant 
in *this* report? (Sorry for the confusion.)

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14200
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14200] Idle shell crash on printing non-BMP unicode character

2012-03-05 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

Hi,
while testing python 3.3a1 a bit, especially the new string handling of non-BMP 
characters, I noticed a problem in Idle in this regard:

Python 3.3.0a1 (default, Mar  4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on 
win32 ... 
[using win XPp SP3 Czech]

 got_ahsa = \N{GOTHIC LETTER AHSA}
 len(got_ahsa)
1
 got_ahsa.encode(unicode-escape)
b'\\U00010330'
 got_ahsa

[crash - idle shell window closes immediately without any visible error message 
or traceback]


I realised later, that tkinter probably won't be able to print wide-unicode 
characters anyway (according to 
http://bugs.python.org/issue12342 ), but Idle should probably just print the 
exception introduced there, e.g.
ValueError: character U+10330 is above the range (U+-U+) allowed by Tcl

Regards
vbr

--
components: IDLE, Tkinter, Unicode
messages: 154944
nosy: ezio.melotti, vbr
priority: normal
severity: normal
status: open
title: Idle shell crash on printing non-BMP unicode character
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14200
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14200] Idle shell crash on printing non-BMP unicode character

2012-03-05 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Hi,
thanks for the pointer, after invoking idle using python.exe, I don't see the 
crash mentioned in the report:

Python 3.3.0a1 (default, Mar  4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on 
win32
Type copyright, credits or license() for more information.
 got_ahsa = \N{GOTHIC LETTER AHSA}
 len(got_ahsa)
1
 got_ahsa.encode(unicode-escape)
b'\\U00010330'
 got_ahsa

 print(got_ahsa)

 


I just get empty line as answer but no crash.

The console indeed contains the traceback with the error I expected

   vbr



Microsoft Windows XP [Verze 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Python33python.exe -m idlelib.idle
*** Internal Error: rpc.py:SocketIO.localcall()

 Object: stdout
 Method: bound method PseudoFile.write of idlelib.PyShell.PseudoFile object at
 0x01CDDB50
 Args: ('\U00010330',)

Traceback (most recent call last):
  File C:\Python33\lib\idlelib\rpc.py, line 188, in localcall
ret = method(*args, **kwargs)
  File C:\Python33\lib\idlelib\PyShell.py, line 1244, in write
self.shell.write(s, self.tags)
  File C:\Python33\lib\idlelib\PyShell.py, line 1226, in write
OutputWindow.write(self, s, tags, iomark)
  File C:\Python33\lib\idlelib\OutputWindow.py, line 40, in write
self.text.insert(mark, s, tags)
  File C:\Python33\lib\idlelib\Percolator.py, line 25, in insert
self.top.insert(index, chars, tags)
  File C:\Python33\lib\idlelib\ColorDelegator.py, line 80, in insert
self.delegate.insert(index, chars, tags)
  File C:\Python33\lib\idlelib\PyShell.py, line 322, in insert
UndoDelegator.insert(self, index, chars, tags)
  File C:\Python33\lib\idlelib\UndoDelegator.py, line 81, in insert
self.addcmd(InsertCommand(index, chars, tags))
  File C:\Python33\lib\idlelib\UndoDelegator.py, line 116, in addcmd
cmd.do(self.delegate)
  File C:\Python33\lib\idlelib\UndoDelegator.py, line 219, in do
text.insert(self.index1, self.chars, self.tags)
  File C:\Python33\lib\idlelib\ColorDelegator.py, line 80, in insert
self.delegate.insert(index, chars, tags)
  File C:\Python33\lib\idlelib\WidgetRedirector.py, line 104, in __call__
return self.tk_call(self.orig_and_operation + args)
ValueError: character U+10330 is above the range (U+-U+) allowed by Tcl

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14200
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14200] Idle shell crash on printing non-BMP unicode character

2012-03-05 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd like to add some further observations to the mentioned issue;
it seems, that the crash is indeed not specific to idle.
In a sample tkinter app, where I just display e.g. chr(66352) in an Entry 
widget, I also get the same immediate crash via pythonw.exe and the previously 
mentioned proper ValueError without a crash with python.exe.

I also tried to explicitly display surrogate pair, which were used 
automatically until python 3.2; these can be used in tkinter in 3.3, but there 
are limitations and discrepancies:

 
 got_ahsa = \N{GOTHIC LETTER AHSA}
 def wide_char_to_surrog_pair(char):
code_point = ord(char)
if code_point = 0x:
return char
else:
high_surr = (code_point - 0x1) // 0x400 + 0xD800
low_surr = (code_point - 0x1) % 0x400 + 0xDC00
return chr(high_surr)+chr(low_surr)

 ahsa_surrog = wide_char_to_surrog_pair(got_ahsa)
 print(ahsa_surrog)
̰
 repr(ahsa_surrog)
'_ud800\x00udf30'
 ahsa_surrog
'Pud800 udf30'

[the space in the middle of the last item might be \x00, as it terminates the 
clipboard content, the rest is copied separately]

the printed square corresponds with the given character and can be used in 
other programs etc. (whereas in py 3.2, the same value was used for repr and a 
direct display of the string in the interpreter, there are three different 
formats in py 3.3.

I also noticed that surogate pair is not supported as input for 
unicodedata.name(...) anymore:
 
 import unicodedata
 unicodedata.name(ahsa_surrog)
Traceback (most recent call last):
  File pyshell#60, line 1, in module
unicodedata.name(ahsa_surrog)
TypeError: need a single Unicode character as parameter
 

(in 3.2 and probably others it returns the expected 'GOTHIC LETTER AHSA')

(I for my part would think, that e.g. keeping a  bit liberal (but still 
non-ambiguous) input possibilities for unicodedata wouldn't hurt. Also, if 
tkinter is not going to support wide unicode natively any time soon, the output 
conversion using surrogates, which are also understandable for other programs, 
seems the most usable option in this regard.

Hopefully, this is somehow relevant for the original issue -
I am somehow not sure, whether some parts would be better posted as separate 
issues, or whether this is the planned and expected behaviour anyway.

regards,
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14200
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Adding a new regex module (compatible with re)

2011-09-03 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Not that it matters in any way, but if the regex semantics has to be 
distinguished via non-standard custom flags; I would prefer even less wordy 
flags, possibly such that the short forms for the in-pattern flag setting would 
be one-letter (such as all the other flags) and preferably some with underlying 
plain English words as base, to get some mnemotechnics (which I don't see in 
the numbered versions requiring one to keep track of the rather internal 
library versioning).
Unfortunately, it might be difficult to find suitable names, given the 
objections expressed against the already discussed ones. (FOr what it is worth, 
I thought e.g. of [t]raditional and [e]nhanced, but these also suffer from some 
of the mentioned disadvantages... 
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Adding a new regex module (compatible with re)

2011-09-02 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd agree with Steven ( msg143377 ) and others, that there probably shouldn't 
be a large library-specific set of new tags just for housekeeping purposes 
between re and regex. I would personally prefer, that these tags also be 
settable in the pattern (?...), which would probably be problematic with 
versioned flags.

Although I am trying to take advantage of the new additions, if applicable, I 
agree, that there should be a possibility to use regex in an unreflected way 
with the same behaviour like re (maybe except for the fixes of what will be 
agreed on to be a bug (enough)).
On the other hand, it seems to me, that the enhancements/additions can be 
enabled at once, as an user upgrading the regexes for the new library 
consciously (or a new user not knowing re) can be supposed to know the new 
features and their implications. I guess, it is mostly trivially possible to 
fix/disambiguate the problematic patterns, e.g. by escaping.

As for setting the new/old behaviour, would there be a possibility to 
distinguish it just by importing (possibly through some magic, without the need 
to duplicate the code?), 
import re_in_compat_mode as re
vs:
import re_with_all_the_new_features as re

Unfortunately, i have no idea, whether this is possible or viable...
with this option, the (user) code update could be just the change of the 
imports instead of adding the flags to all relevant places (and to take them 
away as redundant, as the defaults evolve with the versions...).

However, it is not clear, how this aliasing would work out with regard to the 
transition, maybe the long differenciated module names could be kept and the 
meaning of import re would  change, allong with the previous warnings, in 
some future version.

just a few thoughts...
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-03 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the comment for string.letters and further reference.
Given, that Mr. Barnett mentioned in his tracker to regex ( 
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6 ), that he only 
supports the LOCALE flag because of the compatibility with re and given my zero 
knowledge of C, I suppose, we will live with the status quo.
I guess, if there were a well defined source of letters for the given 
locales, the implementation wouldn't necessarily have to be be that complex (in 
the context of the regex code), but as there is probably no agreement in this 
respect (if string.letters is questionable), it becomes pointless.
After all, one can define a needed regex pattern manually, and mrab's regex 
library makes it much easier due to the support for unicode properties and 
others.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11744] re.LOCALE doesn't reflect locale.setlocale(...)

2011-04-02 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

Hi,
I just noticed a behaviour of the re.LOCALE flag I can't understand; I first 
reported this to the new regex implementation, which, however, only mimics the 
standard lib re in this case:
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6
I also couldn't find anything relevant in the tracker, other than some older, 
already fixed issues; I'm sorry, if I missed something.
I thought, the search pattern (?L)\w would match any of the respective 
string.letters according to the current locale (and possibly additionally 
[0-9_]).

However, the locale doesn't seem to be reflected in an expected way.

 unicode_BMP =   + .join(unichr(i)for i in range(1, 0x1))
 import locale
 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 import re
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
 locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzƒ¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
 

 unicode_BMP =   + .join(unichr(i)for i in range(1, 0x1))

 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 print unicode(string.letters, windows-1250)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
 locale.setlocale(locale.LC_ALL, Greek)
'Greek_Greece.1253'
 print unicode(string.letters, windows-1253)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
 

It seems that the nearest letter set to the result of the re/regex LOCALE flags 
migt be ascii or US locale:

 locale.setlocale(locale.LC_ALL, US)
'English_United States.1252'
 print unicode(string.letters, windows-1252)
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
 

however, there are some differences too, namely between zƒ and À
re (?L)\w : 
Czech
zŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿À
Greek
zƒ¢²³µ¸¹º¼¾¿À
string.letters -- US locale
zƒŠŒŽšœžŸªµºÀ
(as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word 
characters, cf. ¿)

I am not sure whether there are no other issues (like some encoding/displaying 
peculiarities in Tkinter), but the re matching using the LOCALE flag don't 
reflect the locale.setlocale(...) in a transparent way.

Is it supposed to work this way and is there another possibility to get the 
expected locale aware matching, as one might expect according to:
http://docs.python.org/library/re.html#re.LOCALE

Make \w, \W, \b, \B, \s and \S dependent on the current locale.



using Python 2.7.1, 32 bit;  win 7 Home Premium 64-bit, Czech.

in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately 
modified code): ...
 import locale
 locale.setlocale(locale.LC_ALL, )
'Czech_Czech Republic.1250'
 import re
 print(.join(re.findall(r(?L)\w, unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
 

However, in Python 3, there is no comparison with string.letters available 
anymore.

Regards,
Vlastimil Brom

--
components: Regular Expressions, Unicode
messages: 132826
nosy: vbr
priority: normal
severity: normal
status: open
title: re.LOCALE doesn't reflect locale.setlocale(...)
type: behavior
versions: Python 2.7, Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11744
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:

龼 (0x9fbc) - 鿋 (0x9fcb)
 (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

꜀ (0x2a700) - 뜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

띀 (0x2b740) - 렝 (0x2b81d)
 (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... 
etc.

(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - 
builtin module (unidata_version: '5.2.0') only the first two ranges are 
relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)

(Also there are the unprintable ASCII controls, surrogates and private use 
areas, where the missing names are probably ok.)


I tested with the following rather clumsy code:

# # # # # # # # # # # # # # # 
# wide_unichr = custom unichr emulating unicode ranges beyond  on narrow 
python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10+1):
if unicodedata.category(wide_unichr(i))[:1] != 'C' and 
unicodedata.name(wide_unichr(i), u??noname??) == u??noname??:
if codepoints_missing_char_names[-1][1] == i-1:
codepoints_missing_char_names[-1][1] = i
else:
codepoints_missing_char_names.append([i, i])

for first, last in codepoints_missing_char_names[1:]:
print u%s (%s) - %s (%s) % (wide_unichr(first), hex(first), 
wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # # 

Unfortunately, I can't provide a fix, as unicodedata involves C code, where my 
knowledge is near zero.

vbr

--
messages: 121521
nosy: vbr
priority: normal
severity: normal
status: open
title: missing character names in unicodedata (CJK...)

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Vlastimil Brom

Changes by Vlastimil Brom vlastimil.b...@gmail.com:


--
components: +Library (Lib), Unicode
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-13 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd have liked to suggest updating the underlying unicode data to the latest 
standard 6.0, but it turns out, it might be problematic with the cross-version 
compatibility;
according to the clarification in 
http://bugs.python.org/issue10400
the 3... versions are going to be updated, while it is not allowed in the 2.x 
series.
I guess it would cause maintainance problems (as the needed properties are not 
available via unicodedata).
Anyway, while I'd like the recent unicode data to be supported (new characters, 
ranges, scripts, and corrected individual properties...),
I'm much happier, that there is support for the 2 series in regex...
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-13 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thank you very much!
a quick test with my custom unicodedata with 6.0 on py 2.7 seems ok.
I hope, there won't be problems with cooperation of the more recent internal 
data with the original 5.2 database in python 2.x releases.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10400] updating unicodedata to Unicode 6

2010-11-12 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

I'd like to suggest updating the unicodedata module according to the recent 
Unicode standard 6.0
http://www.unicode.org/versions/Unicode6.0.0/
I'm sorry to bother, in case this is planned automatically, I just wasn't able 
to find the respective information.
Would it be possible to apply such update also for the upcomming python 2.7.1, 
or are there some showstoppers/incompatibilities... with regard to the new 
unicode version?
  regards,
  vbr

--
components: Unicode
messages: 121070
nosy: vbr
priority: normal
severity: normal
status: open
title: updating unicodedata to Unicode 6
type: feature request
versions: Python 2.7, Python 3.1, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10400
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10400] updating unicodedata to Unicode 6

2010-11-12 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the clarification;
I obviously looked in an inappropriate branch before.
Sorry for the noise...
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10400
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-11 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Maybe I am missing something, but the result in regex seem ok to me:
\A is treated like A in a character set; when the test string is changed to A 
b c or in the case insensitive search the A is matched.

[\A\s]\w doesn't match the starting a, as it is not followed by any word 
character:

 for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'A b c')
... 
['A']
[]
[' b', ' c']
 for s in [r'\A\w', r'(?i)[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'a b 
 c')
... 
['a']
[]
[' b', ' c']
 

In the original re there seem to be a bug/limitation in this regard (\A and 
also \Z in character sets aren't supported in some combinations...

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-02 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

There seems to be a bug in the handling of numbered backreferences in sub() in
issue2636-20101102.zip
I believe, it would be a fairly new regression, as it would be noticed rather 
soon.
(tested on Python 2.7; winXP)

 re.sub(([xy]), -\\1-, abxc)
'ab-x-c'
 regex.sub(([xy]), -\\1-, abxc)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python27\lib\regex.py, line 176, in sub
return _compile(pattern, flags).sub(repl, string, count, pos, endpos)
  File C:\Python27\lib\regex.py, line 375, in _compile_replacement
compiled.extend(items)
TypeError: 'int' object is not iterable


vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-11-02 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for the noise, please, forgot my previous msg120215;
I somehow managed to keep an older version of _regex_core.py along with the new 
regex.py in the Lib directory, which are obviously incompatible.
After updating the files correctly, the mentioned examples work correctly.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I tried to give the 64-bit version a try, but I might have encountered a more 
general difficulties.
I tested this on Windows 7 Home Premium (Czech), the system is 64-bit (or I've 
hoped so sofar :-), according to System info: x64-based PC
I installed
Python 2.7 Windows X86-64 installer
from http://www.python.org/download/
which run ok, but the header in the python shell contains win32

Python 2.7 (r27:82525, Jul  4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on 
win32
Type help, copyright, credits or license for more information.

Consequently, after copying the respecitive files from issue2636-20101009.zip
I get an import error:

 import regex
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python_64bit_27\lib\regex.py, line 253, in module
from _regex_core import *
  File C:\Python_64bit_27\lib\_regex_core.py, line 53, in module
import _regex
ImportError: DLL load failed: %1 nenÝ platnß aplikace typu Win32.

 

(The last part of the message is a in Czech with broken diacritics:
 %1 is not a valid Win32 type application.)

Is there something I can do in this case? I'd think, the installer would refuse 
to install a 64-bit software on a 32-bit OS or 32-bit architecture, or am I 
missing something obvious from the naming peculiarities x64, 64bit etc.?
That being said, I probably don't need to use 64-bit version of python, 
obviously, it isn't a wide unicode build mentioned earlier, hence
 len(u\U00010333) # is still: 
2

And I currently don't have special memory requirements, which might be better 
addressed on a 64-bit system.

If there is something I can do to test regex in this environment, please, let 
me know;
On the same machine the 32-version is ok:
Python 2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on 
win32
Type help, copyright, credits or license for more information.
 import regex


regards
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Well, it seemed to me too,
I happened to read the last post from Matthew, msg118243, in the sense that he 
made some updates which need testing on a 64 bit system (I am unsure, whether 
hardware architecture, OS type, python build or something else was meant); but 
it must have been somehow separated as a new directory in the 
issue2636-20101009.zip which is not the case.

More generaly, I was somhow confused about the win32 in the shell header in 
the mentioned install.
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-10-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for the noise,
it seems, I can go back to the 32-bit python for now then...
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-21 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Well, of course, the surrogates probably shouldn't be handled separately in one 
module independently of the rest of the standard library. (I actually don't 
know such narrow implementation (although it is mentioned in those unicode 
quidelines 
http://unicode.org/reports/tr18/#Supplementary_Characters )

The main surprise on my part was due to the compile error rather than empty 
match as was the case with re; 
but now I see, that it is a consequence of the newly introduced wide unicode 
notation, the matching behaviour changed consistently.

(for my part, the workarounds I found, seem to be sufficient in the cases I 
work with wide unicode; most likely I am not going to compile wide unicode 
build on windows myself in the near future :-)
 vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-20 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I like the idea of the general new flag introducing the reasonable, backwards 
incompatible behaviour; one doesn't have to remember a list of non-standard 
flags to get this features.

While I recognise, that the module probably can't work correctly with wide 
unicode characters on a narrow python build (py 2.7, win XP in this case), i 
noticed a difference to re in this regard (it might be based on the absence of 
the wide unicode literal in the latter).

re.findall(u\\U00010337, ua\U00010337bc)
[]
re.findall(u(?i)\\U00010337, ua\U00010337bc)
[]
regex.findall(u\\U00010337, ua\U00010337bc)
[]
regex.findall(u(?i)\\U00010337, ua\U00010337bc)
Traceback (most recent call last):
  File input, line 1, in module
  File C:\Python27\lib\regex.py, line 203, in findall
return _compile(pattern, flags).findall(string, pos, endpos,
  File C:\Python27\lib\regex.py, line 310, in _compile
parsed = parsed.optimise(info)
  File C:\Python27\lib\_regex_core.py, line 1735, in optimise
if self.is_case_sensitive(info):
  File C:\Python27\lib\_regex_core.py, line 1727, in is_case_sensitive
return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x1) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for U00010337 ), 
regex doesn't recognise the wide unicode as surrogate pair either, but it also 
raises an error from narrow unichr. Not sure, whether/how it should be fixed, 
but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can 
imagine, that it would open a whole can of worms, as this is not thoroughly 
supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly 
with hacks like extended unichr
(\U+hex(67)[2:].zfill(8)).decode(unicode-escape) 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x1

Actually, using regex, one can work around some of these limitations of len, 
index or slice using a list form of the string containing surrogates

regex.findall(ur(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|., uab̷̸̹cd)
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even 
extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Not that my opinion matters, but for what is it worth, I find it rather unusual 
to have to use special flags to get normal (for some definition of normal) 
behaviour, while retaining the defaults buggy in some way (like ZEROWIDTH). I 
would think, the backwards compatibility would not be needed under these 
circumstances - in such probably marginal cases (or is setting global flags at 
the end or otherwhere than on beginning oof the pattern that frequent?). It 
seems, that with many new features and enhancements for previously impossible 
patterns, chances are, that the code using regular expressions in a more 
advanced way might benefit from reviewing the patterns (where also the flags 
for historical behaviour could be adjusted if really needed).

Anyway, thanks for further improvements! (although it broke my custom function 
previously misusing the internal data of the regex module for getting the 
unicode script property (currently unavailable via unicodedata) :-).

Best regards,
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-09-12 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Just another rather marginal findings; differences between regex and re:

 regex.findall(r[\B], aBc)
['B']
 re.findall(r[\B], aBc)
[]

(Python 2.7 ... on win32; regex - issue2636-20100912.zip)
I believe, regex is more correct here, as uppercase \B doesn't have a special 
meaning within a set (unlike backspace \b), hence it should be treated as B, 
but I wanted to mention it as a difference, just in case it would matter.

I also noticed another case, where regex is more permissive:

 regex.findall(r[\d-h], ab12c-h)
['1', '2', '-', 'h']
 re.findall(r[\d-h], ab12c-h)
Traceback (most recent call last):
  File input, line 1, in module
  File re.pyc, line 177, in findall
  File re.pyc, line 245, in _compile
error: bad character range
 

howewer, there might be an issue in negated sets, where the negation seem to 
apply for the first shorthand literal only; the rest is taken positively

 regex.findall(r[^\d-h], a^b12c-h)
['-', 'h']

cf. also a simplified pattern, where re seems to work correctly:

 regex.findall(r[^\dh], a^b12c-h)
['h']
 re.findall(r[^\dh], a^b12c-h)
['a', '^', 'b', 'c', '-']
 

or maybe regardless the order - in presence of shorthand literals and normal 
characters in negated sets, these normal characters are matched positively

 regex.findall(r[^h\s\db], a^b 12c-h)
['b', 'h']
 re.findall(r[^h\s\db], a^b 12c-h)
['a', '^', 'c', '-']
 

also related to character sets but possibly different - maybe adding a 
(reduntant) character also belonging to the shorthand in a negated set seem to 
somehow confuse the parser:

regex.findall(r[^b\w], a b)
[]
re.findall(r[^b\w], a b)
[' ']

regex.findall(r[^b\S], a b)
[]
re.findall(r[^b\S], a b)
[' ']

 regex.findall(r[^8\d], a 1b2)
[]
 re.findall(r[^8\d], a 1b2)
['a', ' ', 'b']
 

I didn't find any relevant tracker issues, sorry if I missed some...
I initially wanted to provide test code additions, but as I am not sure about 
the intended output in all cases, I am leaving it in this form;

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-07-18 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the update;
Just a small observation regarding some character ranges and ignorecase, 
probably irrelevant, but a difference to the current re anyway:

 zero2z = 
 u0123456789:;=?...@abcdefghijklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz

 re.findall((?i)[X-d], zero2z)
[]

 regex.findall((?i)[X-d], zero2z)
[u'A', u'B', u'C', u'D', u'X', u'Y', u'Z', u'[', u'\\', u']', u'^', u'_', u'`', 
u'a', u'b', u'c', u'd', u'x', u'y', u'z']



re.findall((?i)[B-d], zero2z)
[u'B', u'C', u'D', u'b', u'c', u'd']

regex.findall((?i)[B-d], zero2z)
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', 
u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', 
u'[', u'\\', u']', u'^', u'_', u'`', u'a', u'b', u'c', u'd', u'e', u'f', u'g', 
u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', 
u'u', u'v', u'w', u'x', u'y', u'z']

It seems, that the re module is building the character set using a case 
insensitive alphabet in some way.

I guess, the behaviour of re is buggy here, while regex is ok (tested on py 
2.7, Win XPp).

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2986] difflib.SequenceMatcher not matching long sequences

2010-07-07 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I guess, I am not supposed to post to python-dev - not being a python 
developer, hopefully it is appropriate to add a comment here - only based on my 
current usage of (a modified) difflib.SequenceMatcher.
It seems, the mentions of text comparison in that thread, e.g. 
http://mail.python.org/pipermail/python-dev/2010-July/101515.html
etc. rather imply line-by-line comparison, and possibly character comparison of 
matched lines.
For me the direct character-wise comparison is more useful in most cases.
With the popular heuristics disabled the results look pretty well.
(the script only involves changing the background colour of the compared texts 
- based on the SequenceMatcher - get_opcodes() )
Just now, I only need to disable the popular check, currently I use a 
monkey-patched subclass of SequenceMatcher with extended signature and modified 
__chain_b function.
cf. http://mail.python.org/pipermail/python-list/2010-June/1247907.html

I would vote for extending the SequenceMatcher API to enable adjustments 
(leaving the default values as the current ones) - enable/disable popular 
check, set the thresholds for string length and popular frequency (and 
eventually other parameters, which might be added).

Are there some restrictions on API changes in a library due to a moratorium - 
even if the default behaviour remains unchanged?
Otherwise, what might be the disadvantages of this approach?
If the current behaviour is considered appropriate for the original usecases, 
other uses would be also made possible/easier - only at the cost of learning 
the meaning of the added parameters - from the enhanced docs, of course.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2986
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-07-06 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the prompt fix!
It would indeed be nice to see this enhanced re module in the standard library 
e.g. in 3.2, but I also really appreciate, that also multiple 2.x versions are 
supported (as my current main usage of this library involves py2-only wx gui).
As for the usage statistics, I for one always downloaded the updates from here 
rather than pypi, but maybe it is not a regular case.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-07-05 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just noticed a somehow strange behaviour in matching character sets or 
alternate matches which contain some more advanced unicode characters, if 
they are in the search pattern with some simpler ones. The former seem to be 
ignored and not matched (the original re engine matches all of them); (win XPh 
SP3 Czech, Python 2.7; regex issue2636-20100414)

 print u.join(regex.findall(u., ueèéêëēěė))
eèéêëēěė
 print u.join(regex.findall(u[eèéêëēěė], ueèéêëēěė))
eèéêëē
 print u.join(regex.findall(ue|è|é|ê|ë|ē|ě|ė, ueèéêëēěė))
eèéêëē
 print u.join(re.findall(u[eèéêëēěė], ueèéêëēěė))
eèéêëēěė
 print u.join(re.findall(ue|è|é|ê|ë|ē|ě|ė, ueèéêëēěė))
eèéêëēěė

even stranger, if the pattern contains only these higher unicode characters, 
everything works ok: 
 print u.join(regex.findall(uē|ě|ė, ueèéêëēěė))
ēěė
 print u.join(regex.findall(u[ēěė], ueèéêëēěė))
ēěė


The characters in question are some accented latin letters (here in ascending 
codepoints), but it can be other scripts as well.
 print regex.findall(u., ueèéêëēěė)
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']

The threshold isn't obvious to me, at first I thought, the characters 
represented as unicode escapes are problematic, whereas those with hexadecimal 
escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
 regex.findall([eèéêëēěė], eèéêëēěė)
['e', 'è', 'é', 'ê', 'ë', 'ē']
 regex.findall([ēěė], eèéêëēěė)
['ē', 'ě', 'ė']
)

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2986] difflib.SequenceMatcher not matching long sequences

2010-04-19 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just stumbled on some seemingly different unexpected behaviour of
difflib.SequenceMatcher, but it turns out, it may have the same cause, i.e. the 
popular heuristics.
I hopefully managed to replicate it on an illustrative sample text - in as 
included in the attached file. (I also mentioned this issue in hte python-list 
http://mail.python.org/pipermail/python-list/2010-April/1241951.html but as 
there were no replies I eventually found, this might be more appropriate place.)
Both strings differ in a minimal way, each having one extra character
in a strategic position, which probably meets some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
I didn't find any plausible mentions of such cases in the documentation, but 
after some searching I found several reports in the bug tracker mentioning the 
erroneous output of SequenceMatcher on longer repetitive sequences.

besides this
http://bugs.python.org/issue2986
e.g.
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074

In my case, disabling the popular heuristics as mentioned by John Machin in
http://bugs.python.org/issue1528074#msg29269

seems to have solved the problem; with a modified version of difflib containing:

if 0:   # disable popular heuristics
if n = 200 and len(indices) * 100  n:
populardict[elt] = 1
del indices[:]

the comparison catches the differences in the test strings as expected - i.e. 
one character addition and deletion only. It is likely, that some other use 
cases for difflib may rely on the popular-heuristics but it also seems useful 
to have some control over this behaviour, which might not be appropriate in all 
cases.
(The issue seems to be the same in python 2.5, 2.6 and 3.1.)

regards,
   vbr

--
nosy: +vbr
Added file: http://bugs.python.org/file17001/difflib_test_inq.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2986
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-03-16 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I am not sure about the testsuite for this regex module, but it seems to me, 
that many of the problems reported here probably don't apply for the current 
builtin re, as they are connected with the new features of regex.
After the suggestion in msg91462. I briefly checked the re testsuite and found 
it very comprehensive, given the featureset. Of course, most/all? re tests 
should apply for regex, but probably not vice versa.
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-03-03 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just noticed a cornercase with the newly introduced grapheme matcher \X, if 
this is used in the character set:

 regex.findall(\X, abc)
['a', 'b', 'c']
 regex.findall([\X], abc)
Traceback (most recent call last):
  File input, line 1, in module
  File regex.pyc, line 218, in findall
  File regex.pyc, line 1435, in _compile
  File regex.pyc, line 2351, in optimise
  File regex.pyc, line 2705, in optimise
  File regex.pyc, line 2798, in optimise
  File regex.pyc, line 2268, in __hash__
AttributeError: '_Sequence' object has no attribute '_key'

It obviously doesn't make much sense to use this universal literal in the 
character class (the same with . in its metacharacter role) and also 
http://www.regular-expressions.info/refunicode.html doesn't mention this 
possibility; but the error message might probably be more descriptive, or the 
pattern might match X or \ and \X (?)

I was originally thinking about the possibility to combine the positive and 
negative character classes, where e.g. \X would be a kind of base; I am not 
aware of any re engine supporting this, but I eventually found an unicode 
guidelines for regular expressions, which also covers this:

http://unicode.org/reports/tr18/#Subtraction_and_Intersection

It also surprises a bit, that these are all included in
Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, 
differences ...) it suggests, that there is probably no implementation 
available (AFAIK) - even on this basic level, according to this guideline.

Among other features on this level, the section
http://unicode.org/reports/tr18/#Supplementary_Characters
seems useful, especially the handling of the characters beyond \u, also in 
the form of surrogate pairs as single characters.

This might be useful on the narrow python builds, but it is possible, that 
there would be be an incompatibility with the handling of these data in 
narrow python itself.

Just some suggestions or rather remarks, as you already implemented many 
advanced features and are also considering some different approaches ...:-)

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-03-03 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Actually I had that impression too, but I was mainly surprised with these 
requirements being on the lowest level of the unicode support. Anyway, maybe 
the relevance of these guidelines for the real libraries is is lower, than I 
expected.

Probably the simpler cases are adequately handled with lookarounds, e.g. 
(?:\w(?!\p{Greek}))+ and the complex examples like symmetric differences seem 
to be beyond the normal scope of re anyway.

Personally, I would find the surrogate handling more useful, but I see, that it 
isn't actually the job for the re library, given that the narrow build of 
python doesn't support indexing, slicing, len  of these characters either...

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-24 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks, its indeed a very nice addition to the library...
Just a marginal remark; it seems, that in script-names also some non BMP 
characters are covered, however, in the unicode ranges thee only BMP.
http://www.unicode.org/Public/UNIDATA/Blocks.txt

Am I missing something more complex, as why 
1.. - ..10; ranges weren't included in _BLOCKS ?
Maybe building these ranges is expensive, in contrast to rare uses of these 
properties?

(Not that I am able to reliably test it on my narrow python build on windows, 
but currently, obviously, e.g. \p{InGothic} gives undefined property name 
whereas \p{Gothic} is accepted.)

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-22 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Is the issue2636-20100222.zip archive supposed to be complete? I can't find not 
only the rst or html features, but more importantly the py and pyd files for 
the particular versions.

Anyway, I just skimmed through the regular-expressions.info documentation and 
found, that most features, which I missed in the builtin re version seems to be 
present in the regex module;
a few possibly notable exceptions being some unicode features:
http://www.regular-expressions.info/unicode.html 
support for unicode script properties might be needlessly complex (maybe unless 
http://bugs.python.org/issue6331 is implemented)

On the other hand \X for matching any single grapheme might be useful, 
according to the mentioned page, the currently working equivalent would be 
\P{M}\p{M}*
However, I am not sure about the compatibility concerns; it is possible, that 
the modifier characters as a part of graphemes might cause some discrepancies 
in the text indices etc. 

A feature, where i personally (currently) can't find a usecase is \G and 
continuing matches (but no doubt, there would be some some cases for this).

regards
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-22 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Wow, that's what can be called rapid development :-), thanks very much!
I did'n noticed before, that \G had been implemented already.
\X works fine for me, it also maintains the input string indices correctly.

We can use unicode character properties \p{Letter} and unicode bloks 
\p{inBasicLatin} properties; 
the script properties like \p{Latin} or \p{IsLatin} return undefined property 
name.
I guess, this would require the access to the respective information in 
unicodedata, where it isn't available now (there also seem to be much more 
scripts than those mentioned at regular-expressions.info
cf.
http://www.unicode.org/Public/UNIDATA/Scripts.txt
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt (under # Script 
(sc)).

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-18 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for fixing the argument positions;
unfortunately, it seems, there might be some other problem, that makes my code 
work differently than the builtin re;
it seems, in the character classes the ignorcase flag is ignored somehow: 

 regex.findall(r[ab], aB, regex.I)
['a']
 re.findall(r[ab], aB, re.I)
['a', 'B']
 

(The same with the flag set in the pattern.)

Outside of the character class the case seems to be handled normally, or am I 
missing something?

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-17 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just tested the fix for unicode tracebacks and found some possibly weird 
results (not sure how/whether it should be fixed, as these inputs are indeed 
rather artificial...).
(win XPp SP3 Czech, Python 2.6.4)

Using the cmd console, the output is fine (for the characters it can accept and 
display)

 regex.findall(ur\p{InBasicLatinĚ}, uaé)
Traceback (most recent call last):
...
  File C:\Python26\lib\regex.py, line 1244, in _parse_property
raise error(undefined property name '%s' % name)
regex.error: undefined property name 'InBasicLatinĚ'


(same result for other distorted proprety names containing e.g. 
ěščřžýáíéúůßäëiöüîô ...

However, in Idle the output differs depending on the characters present

 regex.findall(ur\p{InBasicLatinÉ}, uab c)
yields the expected
...
  File C:\Python26\lib\regex.py, line 1244, in _parse_property
raise error(undefined property name '%s' % name)
error: undefined property name 'InBasicLatinÉ'

but

 regex.findall(ur\p{InBasicLatinĚ}, uab c)

Traceback (most recent call last):
...
  File C:\Python26\lib\regex.py, line 1244, in _parse_property
raise error(undefined property name '%s' % name)
  File C:\Python26\lib\regex.py, line 167, in __init__
message = message.encode(sys.stdout.encoding)
  File C:\Python26\lib\encodings\cp1250.py, line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xcc' in position 
37: character maps to undefined
 

which might be surprising, as cp1250 should be able to encode Ě, maybe there 
is some intermediate ascii step?

using the wxpython pyShell I get its specific encoding error:

regex.findall(ur\p{InBasicLatinÉ}, uab c)
Traceback (most recent call last):
...
  File C:\Python26\lib\regex.py, line 1102, in _parse_escape
return _parse_property(source, info, in_set, ch)
  File C:\Python26\lib\regex.py, line 1244, in _parse_property
raise error(undefined property name '%s' % name)
  File C:\Python26\lib\regex.py, line 167, in __init__
message = message.encode(sys.stdout.encoding)
AttributeError: PseudoFileOut instance has no attribute 'encoding'

(the same for \p{InBasicLatinĚ} etc.)


In python 3.1 in Idle, all of these exceptions are displayed correctly, also in 
other scripts or with special characters.

Maybe in python 2.x e.g. repr(...) of the unicode error messages could be used 
in order to avoid these problems, but I don't know, what the conventions are in 
these cases.


Another issue I found here (unrelated to tracebacks) are backslashes or 
punctuation (except the handled -_) in the property names, which just lead to 
failed mathces and no exceptions about unknown property names

regex.findall(u\p{InBasic.Latin}, uab c)
[]


I was also surprised by the added pos/endpos parameters, as I used flags as a 
non-keyword third parameter for the re functions in my code (probably my fault 
...)

re.findall(pattern, string, flags=0)

regex.findall(pattern, string, pos=None, endpos=None, flags=0, overlapped=False)

(is there a specific reason for this order, or could it be changed to maintain 
compatibility with the current re module?)

I hope, at least some of these remarks make some sense;
  thanks for the continued work on this module!

   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-10 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Thanks for the quick update,
I confirm the fix for both issues;
just another finding (while testing the behaviour mentioned previously - 
msg91917)

The property name normalisation seem to be much more robust now, I just 
encountered an encoding error using a rather artificial input (in python 2.5, 
2.6):

 regex.findall(ur\p{UppercaseÄÄÄLetter}, uQW\p{UppercaseÄÄÄLetter}as)

Traceback (most recent call last):
  File pyshell#4, line 1, in module
regex.findall(ur\p{UppercaseÄÄÄLetter}, uQW\p{UppercaseÄÄÄLetter}as)
  File C:\Python25\lib\regex.py, line 213, in findall
return _compile(pattern, flags).findall(string, overlapped=overlapped)
  File C:\Python25\lib\regex.py, line 599, in _compile
parsed = _parse_pattern(source, info)
  File C:\Python25\lib\regex.py, line 690, in _parse_pattern
branches = [_parse_sequence(source, info)]
  File C:\Python25\lib\regex.py, line 702, in _parse_sequence
item = _parse_item(source, info)
  File C:\Python25\lib\regex.py, line 710, in _parse_item
element = _parse_element(source, info)
  File C:\Python25\lib\regex.py, line 837, in _parse_element
return _parse_escape(source, info, False)
  File C:\Python25\lib\regex.py, line 1098, in _parse_escape
return _parse_property(source, info, in_set, ch)
  File C:\Python25\lib\regex.py, line 1240, in _parse_property
raise error(undefined property name '%s' % name)
error: unprintable error object
 

Not sure, how this would be fixed (i.e. whether the error message should be 
changed to unicode, if applicable).

Not surprisingly, in python 3.1, there is a correct message at the end:

regex.error: undefined property name 'UppercaseÄÄÄLetter'

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-09 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd like to add another issue I encountered with the latest version of regex - 
issue2636-20100204.zip

It seems, that there is an error in handling some quantifiers in python 2.5

on
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on 
win32

I get e.g.:

 regex.findall(urq*, uqqwe)

Traceback (most recent call last):
  File pyshell#35, line 1, in module
regex.findall(urq*, uqqwe)
  File C:\Python25\lib\regex.py, line 213, in findall
return _compile(pattern, flags).findall(string, overlapped=overlapped)
  File C:\Python25\lib\regex.py, line 633, in _compile
p = _regex.compile(pattern, info.global_flags | info.local_flags, code, 
info.group_index, index_group)
RuntimeError: invalid RE code

There is the same error for other possibly infinite quantifiers like q+, 
q{0,} etc. with their non-greedy and possesive variants.

On python 2.6 and 3.1 all these patterns works without errors.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2010-02-08 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Hi, thanks for the update! 
Just for the unlikely case, it hasn't been noticed sofar, using python  2.6.4 
or 2.5.4 with the regexp build issue2636-20100204.zip
I am getting the following easy-to-fix error:

Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on
win32
Type help, copyright, credits or license for more information.
 import regex
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python26\lib\regex.py, line 2003
print Header file written at %s\n % os.path.abspath(header_file.name))
   ^
SyntaxError: invalid syntax

After removing the extra closing paren in regex.py, line 2003, everything seems 
ok.
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-24 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd like to add some detail to the previous msg91473

The current behaviour of the character properties looks a bit 
surprising sometimes:

 
 regex.findall(ur\p{UppercaseLetter}, uQW\p{UppercaseLetter}as)
[u'Q', u'W', u'U', u'L']
 regex.findall(ur\p{Uppercase Letter}, uQW\p{Uppercase Letter}as)
[u'\\p{Uppercase Letter}']
 regex.findall(ur\p{UppercaseÄÄÄLetter}, uQW\p
{UppercaseÄÄÄLetter}as)
[u'\\p{Uppercase\xc4\xc4\xc4Letter}']
 regex.findall(ur\p{UppercaseQQQLetter}, uQW\p
{UppercaseQQQLetter}as)

Traceback (most recent call last):
  File pyshell#34, line 1, in module
regex.findall(ur\p{UppercaseQQQLetter}, uQW\p
{UppercaseQQQLetter}as)
...
  File C:\Python26\lib\regex.py, line 1178, in _parse_property
raise error(undefined property name '%s' % name)
error: undefined property name 'UppercaseQQQLetter'
 

i.e. potential property names consisting only from the ascii-letters  
(+ _, -) are looked up and either used or an error is raised,
other names (containing whitespace or non-ascii letters) aren't treated 
as a special expression, hence, they either match their literal value 
or simply don't match (without errors).

Is this the intended behaviour? 
I am not sure whether it is maybe defined somewhere, or there are some 
de-facto standards for this...
I guess, the space in the property names might be allowed (unless there 
are some implications for the parser...), otherwise the fallback 
handling of invalid property names as normal strings is probably the 
expected way.
vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-11 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

Sorry for the dumb question, which may also suggest, that I'm 
unfortunately unable to contribute at this level (with zero knowledge 
of C and only working one for Python):
Where can I find the sources for tests etc. and how they are eventually 
to be submitted? Is some other account needed besides the one for 
bugs.python.org?

Anyway, the long character properties now work in the latest version 
issue2636-20090810#3.zip

In the mentioned overview 
http://www.regular-expressions.info/unicode.html
there is a statement for the property names: You may omit the 
underscores or use hyphens or spaces instead. 
While I'm not sure, that it is a good thing to have that many 
variations, they should probably be handled in the same way.

Now, the whitespace (and also non ascii characters) in the property 
name seem to confuse the parser: these pass silently (don't match 
anything) and don't throw an exception like undefined property name.

cf.

 regex.findall(ur\p{Dummy Property}, uabcDEF)
[]
 regex.findall(ur\p{DümmýPrópërtý}, uabcDEF)
[]
 regex.findall(ur\p{DummyProperty}, uabcDEF)
Traceback (most recent call last):
  File input, line 1, in module
  File regex.pyc, line 195, in findall
  File regex.pyc, line 563, in _compile
  File regex.pyc, line 642, in _parse_pattern
  File regex.pyc, line 654, in _parse_sequence
  File regex.pyc, line 662, in _parse_item
  File regex.pyc, line 787, in _parse_element
  File regex.pyc, line 1021, in _parse_escape
  File regex.pyc, line 1159, in _parse_property
error: undefined property name 'DummyProperty'
 

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-10 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

First, many thanks for this contribution; it's great, that the re 
module gets updated in that comprehensive way!

I'd like to report some issue with the current version 
(issue2636-20090804.zip).

Using an empty string as the search pattern ends up consuming system 
resources and the function doesn't return anything nor raise an 
exception or crash (within several minutes I tried).
The current re engine simply returns the empty matches on all character 
boundaries in this case.

I use win XPh SP3, the behaviour is the same on python 2.5.4 and 2.6.2:
It should be reproducible with the following simple code:

 import re
 import regex
 re.findall(, abcde)
['', '', '', '', '', '']
 regex.findall(, abcde)
_

regards
vbr

--
nosy: +vbr

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-10 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I'd like to confirm, that the above reported error is fixed in 
issue2636-20090810#2.zip
While testing the new features a bit, I noticed some irregularity in 
handling the Unicode Character Properties; 
I tried randomly some of those mentioned at http://www.regular-
expressions.info/unicode.html using the simple findall like above.

It seems, that only the short abbreviated forms of the properties are 
supported, however, the long variants are handled in different ways.
Namely, the properties names containing whitespace or other non-letter 
characters cause some probably unexpected exception:

 regex.findall(ur\p{Ll}, uabcDEF)
[u'a', u'b', u'c']
# works ok

\p{LowercaseLetter} isn't supported, but seems to be handled, as it 
throws error: undefined property name at the end of the traceback.

\p{Lowercase Letter} \p{Lowercase_Letter} \p{Lowercase-Letter} 
isn't probably expected, the traceback is:

 regex.findall(ur\p{Lowercase_Letter}, uabcDEF)
Traceback (most recent call last):
  File input, line 1, in module
  File C:\Python25\lib\regex.py, line 194, in findall
return _compile(pattern, flags).findall(string)
  File C:\Python25\lib\regex.py, line 386, in _compile
parsed = _parse_pattern(source, info)
  File C:\Python25\lib\regex.py, line 465, in _parse_pattern
branches = [_parse_sequence(source, info)]
  File C:\Python25\lib\regex.py, line 477, in _parse_sequence
item = _parse_item(source, info)
  File C:\Python25\lib\regex.py, line 485, in _parse_item
element = _parse_element(source, info)
  File C:\Python25\lib\regex.py, line 610, in _parse_element
return _parse_escape(source, info, False)
  File C:\Python25\lib\regex.py, line 844, in _parse_escape
return _parse_property(source, ch == p, here, in_set)
  File C:\Python25\lib\regex.py, line 983, in _parse_property
if info.local_flags  IGNORECASE and not in_set:
NameError: global name 'info' is not defined
 

Of course, arbitrary strings other than properties names are handled 
identically.

Python 2.6.2 version behaves the same like 2.5.4.

vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5274] sys.exc_info()[1] - different handling from str() and unicode() - py 2.6

2009-04-16 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just want to confirm, that the reported issue is the same in python 
2.6.2,
is it really the intended behaviour in python 2.6 (as opposed to 2.5)?

vbr

--
components: +Unicode

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5274
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4281] Idle - incorrectly displaying a character (Latin capital letter sharp s)

2009-03-14 Thread Vlastimil Brom

Vlastimil Brom vlastimil.b...@gmail.com added the comment:

I just wanted to confirm, that there isn't a bug in idle nor tk, but 
somwhere in my istalled fonts.
Now while testing python 3.1a1, when I also have a font containing ẞ 
LATIN CAPITAL LETTER SHARP S (DejaVu), it's more clear.
Printing this character using a default font in idle I get the wrong 
glyph mentioned in the report; however this is corrected immediately 
after changing the font to DejaVu.
Some of the fonts on my system seems to shadow this newly added 
character with a wrong glyph (also preventing tk to find a font realy 
suporting this).

Sorry for the needles bug report.
   vbr

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4281
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4281] Idle - incorrectly displaying a character (Latin capital letter sharp s)

2008-11-08 Thread Vlastimil Brom

Vlastimil Brom [EMAIL PROTECTED] added the comment:

I can confirm, that TCL displays the same character as Idle, hence it 
itsn't a bug in Python (cf. the screenshot).
Unfortunately, I couldn't identify the font used here; I'm not able to 
modify and recompile Tk, as suggested, but I tried to check the 
possible serif fonts visually.
None of the fonts listed in Word is identical to the one used for 
capital sharp s in tcl (I created a simple app with Tkinter Label-s 
showing the pairs of the characters in question using the potentially 
similar fonts; while some are really close, in all cases there are 
various differences in glyphs; )

In any case, I guess this isn't a problem in python, which would have 
to be further examined; I have quite a lot of fonts installed, probably 
with some of them behaving in some non-standard ways

Added file: http://bugs.python.org/file11968/capital-sharp-s-TCL-Idle.jpg

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4281
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4281] Idle - incorrectly displaying a character (Latin capital letter sharp s)

2008-11-07 Thread Vlastimil Brom

New submission from Vlastimil Brom [EMAIL PROTECTED]:

While experimenting with the new unicodedata for version 5.1 (many 
thanks for it!) I discovered some strange behaviour of Idle with regard 
to a character not available in any font on my system, namely Latin 
capital letter sharp s - U+1E9E.
Cf. the following sessions:

Python 3.0rc2 (r30rc2:67141, Nov  7 2008, 11:43:46) [MSC v.1500 32 bit 
(Intel)] on win32
Type copyright, credits or license() for more information.
...
IDLE 3.0rc2  

 print(\N{LATIN CAPITAL LETTER SHARP S})
ẞ
 print(\N{LATIN CAPITAL LETTER S WITH CEDILLA})
Ş
 print(\N{PHAGS-PA LETTER KA})
ꡀ
 print(\ufff0)
￰
 hex(ord(ẞ))
'0x1e9e'
 hex(ord(Ş))
'0x15e'
 

Of course, the exact view cannot be copied, but basically I see very 
similar glyphs for the first two characters, while I had expected a 
square-sign or something for the first one; this is what I get with 
other surely unavailable glyph as well as a non existent character. See 
the attached screenshot.

However, the characters remain clearly distinguished, as can be seen 
e.g. after copying them as a parameter of ord(...).

Python 2.6 behaves the same way:
===
Python 2.6 (r26:66721, Oct  2 2008, 11:35:03) [MSC v.1500 32 bit 
(Intel)] on win32
Type copyright, credits or license() for more information.
...
IDLE 2.6  
 print u\N{LATIN CAPITAL LETTER SHARP S}
ẞ
 

...
==

Not that it is much important, but I found it a bit surprising. I'm 
using WinXPh SP3 Czech.

--
components: IDLE, Tkinter, Unicode
files: idle-capital-sharp-s.jpg
messages: 75613
nosy: vbr
severity: normal
status: open
title: Idle - incorrectly displaying a character (Latin capital letter sharp s)
versions: Python 2.6, Python 3.0
Added file: http://bugs.python.org/file11963/idle-capital-sharp-s.jpg

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4281
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1688] Incorrectly displayed non ascii characters in prompt using input() - Python 3.0a2

2008-09-20 Thread Vlastimil Brom

Vlastimil Brom [EMAIL PROTECTED] added the comment:

While I am not sure about the status of this somewhat older issue, I 
just wanted to mention, that the behaviour remains the same in Python 
3.0rc1 (XPh SP3, Czech)

Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit 
(Intel)] on win32
Type help, copyright, credits or license for more information.
 input(ěšč: )
─Ť┼í─Ź: řžý
'řžý'
 print(ěšč: )
ěšč:


Is the patch above supposed to have been committed, or are there yet 
another difficulties?
(Not that it is a huge problem (for me), as applications dealing with 
non ascii text probably would use a gui, rather than relying on a 
console, but it's a kind of surprising.)

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3815] Python 3.0b3 - Idle doesn't start on win XPh

2008-09-09 Thread Vlastimil Brom

New submission from Vlastimil Brom [EMAIL PROTECTED]:

Using Python 3.0b3 on windows XPH SP2 (installed form python-3.0b3.msi) 
Idle can't be started.
Using a windows shortcut, only an error-promt is shown Subprocess 
Startup Error: IDLE's subprocess dien't make connection. Either IDLE 
can't start a subprocess or personal firewall is blocking the 
connection.
I'm aware of the warning about firewalls in IDLE, but the previous 3.0 
betas didn't have that issue with the same settings of the windows 
firewall.

After directly calling:
C:\Python30\python.exe C:\Python30\Lib\idlelib\idle.py

The same error is thrown, but previously another exception is writen to 
the console:

Traceback (most recent call last):
  File string, line 1, in module
  File C:\Python30\lib\idlelib\run.py, line 76, in main
sockthread.set_daemon(True)
AttributeError: 'Thread' object has no attribute 'set_daemon'

Regards,
   vbr

--
components: IDLE
messages: 72843
nosy: vbr
severity: normal
status: open
title: Python 3.0b3 - Idle doesn't start on win XPh
type: crash
versions: Python 3.0

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3815
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3815] Python 3.0b3 - Idle doesn't start on win XPh

2008-09-09 Thread Vlastimil Brom

Vlastimil Brom [EMAIL PROTECTED] added the comment:

Sorry for the noise, somehow my search in the bug tracker didn't show 
this report; after fixing the mentioned line in run.py everything works 
ok.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3815
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1688] Incorrectly displayed non ascii characters in prompt using input() - Python 3.0a2

2007-12-29 Thread Vlastimil Brom

Vlastimil Brom added the comment:

First sorry about a delayed response, but moreover, I fear, preparing a 
patch would be far beyond my programming competence; sorry about that.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1688
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1110] Problems with the msi installer - python-3.0a1.msi

2007-12-07 Thread Vlastimil Brom

Vlastimil Brom added the comment:

I just installed python-3.0a2 and it works fine for me (Win XPh SP2 
Czech; python3 directory C:\Python30). Sofar I haven't found any 
problems other than those mentioned in the release notes.
Thank you very much for fixing this!

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1110
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1110] Problems with the msi installer - python-3.0a1.msi

2007-09-05 Thread Vlastimil Brom

New submission from Vlastimil Brom:

I encountered problems installing python 3.0 alpha 1 from the MSI 
installer supplied on the python download page (python-3.0a1.msi). 
If the advanced option of the installer (compile .py files to bytecode 
after installation) is checked - the following message is shown
There is a problem with this Windows installer package. A program run 
as part of the setup did not finish as expected ...
If I don't choose the option to compile files, the installation 
finishes without any visible errors.

The result is in both cases the same however. After calling python.exe 
it shows the version info etc. in the interactive prompt, but it 
doesn't respond in any way.
e.g.
 1+1
object  : RuntimeError('lost sys.stdout',)
type: RuntimeError
refcount: 4
address : 00A65BD0
lost sys.stderr
 

Running of any .py file doesn't work either.

My system is Win XPh SP2 Czech (the same on Win XPp SP2 Czech).

Could possibly the Czech windows version/ language setting/ locale/ 
timezone or whatever be the problem (as there were some problems 
reported with the manual compilation on German or Polish Winsows-
systems)?

Or am I missing something trivial?

Thanks,
 Vlastimil Brom

--
components: Windows
messages: 55665
nosy: vbr
severity: normal
status: open
title: Problems with the msi installer - python-3.0a1.msi
versions: Python 3.0

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1110
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1110] Problems with the msi installer - python-3.0a1.msi

2007-09-05 Thread Vlastimil Brom

Vlastimil Brom added the comment:

The path to the python executable on my system is:
C:\Python30\python.exe

The path to Program Files is C:\Program Files, but it doesn't matter 
in that case, I guess.

And yes, I use the console window (i.e. the cmd window in Windows) - 
the IDLE doesn't run either, as all other .py files (using python 3.0).

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1110
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com