Re: sre is broken in SuSE 9.2

2005-02-13 Thread Martin v. Löwis
Serge Orlov wrote:
Emphasis is mine. So how many libc implementations with
non-unicode wide-character codes do we have in 2005?
Solaris has supported 2-byte wchar_t implementations for many
years, and so I believe did HP-UX and AIX.
ISO C99 defines a constant __STDC_ISO_10646__ which an
implementation can use to indicate that wchar_t uses
Unicode (aka ISO 10646) in all locales. Very few
implementations define this constant at this time, though.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-13 Thread Martin v. Löwis
Denis S. Otkidach wrote:
You are right.  But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?
In glibc, all real locales are based on
/usr/share/locale/i18n/locales/i18n, e.g. for de_DE through
LC_CTYPE
copy i18n
i18n includes U+0430 as a character, through
lower /
...
% TABLE 11 CYRILLIC/
   U0430..U045F;U0461..(2)..U047F;/
This makes U+0430 a letter in all locales including i18n
(unless locally overridden). This entire approach apparently
is based on ISO 14652, which, in section 4.3.3, introduces
the i18n LC_CTYPE category.
Why the C locale does not use i18n, I don't know. Most likely,
the intention is that the C locale works without any
additional data files - you should ask the glibc developers.
OTOH, there is a definition file POSIX for what appears
to be the POSIX locale.
I'd like to point out that this implementation is potentially
in violation of ISO 14652; annex A.2.2 says that the notion
of a POSIX locale is replaced with the i18n FDCC-set. So
accordingly, I would expect that i18n is used in POSIX as
well - see for yourself that it isn't in glibc 2.3.2.
Again, I suggest to ask the glibc developers as to why
this is so.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Denis S. Otkidach
On Sat, 12 Feb 2005 09:42:41 +0100
Fredrik Lundh [EMAIL PROTECTED] wrote:

 the relevant part for this thread is *locale-*.  if wctype depends on
 the locale, it cannot be used for generic build.  (custom interpreters
 are an- other thing, but they shouldn't be shipped as python).

You are right.  But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Martin v. Löwis
Serge Orlov wrote:
  To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable. 
Yes, but U+00E4 is not in the portable character set. The portable
character set is defined here:
http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Serge Orlov
Martin v. Löwis wrote:
 Serge Orlov wrote:
   To summarize the discussion: either it's a bug in glibc or there
 is an
 option to specify modern POSIX locale. POSIX locale consist of
 characters from the portable character set, unicode is certainly
 portable.

 Yes, but U+00E4 is not in the portable character set. The portable
 character set is defined here:

 http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html

Thanks for the link. They write (in 1997 or earlier ?):

 The wide-character value for each member of the Portable
Character Set will equal its value when used as the lone character
 in an integer character constant. Wide-character codes for other
characters are locale- and *implementation-dependent*

Emphasis is mine. So how many libc implementations with
non-unicode wide-character codes do we have in 2005?
I'm really interested to know.

  Serge.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Serge Orlov
Fredrik Lundh wrote:
 Serge Orlov wrote:

 re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
 [u'\xb5\xba\xe4\u0430']

 I can't find the strict definition of isalpha, but I believe average
 C program shouldn't care about the current locale alphabet, so
 isalpha is a union of all supported characters in all alphabets

 btw, what does isalpha have to do with this example?

It has to do with this thread. u'\xe4'.isalpha() returns false in
Suse. It's in the same boat as \w

  Serge.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Serge Orlov
Fredrik Lundh wrote:
 Serge Orlov wrote:

 re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
 [u'\xb5\xba\xe4\u0430']

 I can't find the strict definition of isalpha, but I believe average
 C program shouldn't care about the current locale alphabet, so
 isalpha is a union of all supported characters in all alphabets

 nope.  isalpha() depends on the locale, as does all other ctype
 functions (this also applies to wctype, on some platforms).

I mean all supported characters in all alphabets [in the current
locale]. For example in ru_RU.koi8-r isalpha should return
true for characters in English and Russian alphabets. In
ru_RU.koi8-u -- for characters in English, Russia and Ukrain
alphabets, in ru_RU.utf-8 -- for all supported by the implementation
alphabetic characters in unicode. IMHO iswalpha in POSIX
locale can return true for all alphabetic characters in unicode
instead of being limited by English alphabet.

  Serge.

true in 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Fredrik Lundh
Serge Orlov wrote:

 The wide-character value for each member of the Portable
 Character Set will equal its value when used as the lone character
 in an integer character constant. Wide-character codes for other
 characters are locale- and *implementation-dependent*

 Emphasis is mine.

the relevant part for this thread is *locale-*.  if wctype depends on the
locale, it cannot be used for generic build.  (custom interpreters are an-
other thing, but they shouldn't be shipped as python).

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-12 Thread Denis S. Otkidach
On Fri, 11 Feb 2005 18:49:53 +0100
Fredrik Lundh [EMAIL PROTECTED] wrote:

   re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
  [u'\xb5\xba\xe4\u0430']
 
  I can't find the strict definition of isalpha, but I believe average
  C program shouldn't care about the current locale alphabet, so isalpha
  is a union of all supported characters in all alphabets
 
 btw, what does isalpha have to do with this example?

The same problem is with isalpha.  In most distributions:
 for c in u'\xb5\xba\xe4\u0430': print c.isalpha(),
... 
True True True True

And in SuSE 9.2:
 for c in u'\xb5\xba\xe4\u0430': print c.isalpha(),
... 
False False False False

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-11 Thread Denis S. Otkidach
On 10 Feb 2005 11:49:33 -0800
Serge Orlov [EMAIL PROTECTED] wrote:

 This thread is about problems only with LANG=C or LANG=POSIX, it's not
 about other locales. Other locales are working as expected.

You are not right.  I have LANG=de_DE.UTF-8, and the Python test_re.py
doesn't pass.  $LANG doesn't matter if I don't call setlocale. 
Fortunately setting any non-C locale solves the problem for all (I
believe) unicode character:

 re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-11 Thread Fredrik Lundh
Serge Orlov wrote:

  re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
 [u'\xb5\xba\xe4\u0430']

 I can't find the strict definition of isalpha, but I believe average
 C program shouldn't care about the current locale alphabet, so isalpha
 is a union of all supported characters in all alphabets

btw, what does isalpha have to do with this example?

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-11 Thread Fredrik Lundh
Serge Orlov wrote:

  re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
 [u'\xb5\xba\xe4\u0430']

 I can't find the strict definition of isalpha, but I believe average
 C program shouldn't care about the current locale alphabet, so isalpha
 is a union of all supported characters in all alphabets

nope.  isalpha() depends on the locale, as does all other ctype functions
(this also applies to wctype, on some platforms).

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Serge Orlov
Denis S. Otkidach wrote:
 On all platfroms \w matches all unicode letters when used with flag
 re.UNICODE, but this doesn't work on SuSE 9.2:

 Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
 [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
 Type help, copyright, credits or license for more
information.
  import re
  re.compile(ur'\w+', re.U).match(u'\xe4')
 

 BTW, is correctly recognize this character as lowercase letter:
  import unicodedata
  unicodedata.category(u'\xe4')
 'Ll'

 I've looked through all SuSE patches applied, but found nothing
related.
 What is the reason for broken behavior?  Incorrect configure options?

I can get the same results on RedHat's python 2.2.3 if I pass re.L
option, it looks like this option is implicitly set in Suse.

  Serge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Denis S. Otkidach
On Thu, 10 Feb 2005 13:00:42 +0300
Denis S. Otkidach [EMAIL PROTECTED] wrote:

 On all platfroms \w matches all unicode letters when used with flag
 re.UNICODE, but this doesn't work on SuSE 9.2:
 
 Python 2.3.4 (#1, Dec 17 2004, 19:56:48) 
 [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
 Type help, copyright, credits or license for more information.
  import re
  re.compile(ur'\w+', re.U).match(u'\xe4')
  
 
 BTW, is correctly recognize this character as lowercase letter:
  import unicodedata
  unicodedata.category(u'\xe4')
 'Ll'
 
 I've looked through all SuSE patches applied, but found nothing
 related. What is the reason for broken behavior?  Incorrect configure
 options?

Just a bit more information. test_re.py fails in SuSE 9.2 with the
following errors:

snip
Running re_tests test suite
=== Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4')
=== Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4')

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Denis S. Otkidach
On 10 Feb 2005 03:59:51 -0800
Serge Orlov [EMAIL PROTECTED] wrote:

  On all platfroms \w matches all unicode letters when used with flag
  re.UNICODE, but this doesn't work on SuSE 9.2:
[...]
 I can get the same results on RedHat's python 2.2.3 if I pass re.L
 option, it looks like this option is implicitly set in Suse.

Looks like you are right:

 import re
 re.compile(ur'\w+', re.U).match(u'\xe4')
 from locale import *
 setlocale(LC_ALL, 'de_DE')
'de_DE'
 re.compile(ur'\w+', re.U).match(u'\xe4')
_sre.SRE_Match object at 0x40375560

But I see nothing related to implicit re.L option in their patches and
the sources themselves are the same as on other platforms.  I'd prefer
to find the source of problem.

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Daniel Dittmar
Denis S. Otkidach wrote:
On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:
I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does 
RedHat), check sys.maxunicode.

This is not an explanation, but perhaps a hint where to look.
Daniel
--
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Denis S. Otkidach
On Thu, 10 Feb 2005 16:23:09 +0100
Daniel Dittmar [EMAIL PROTECTED] wrote:

 Denis S. Otkidach wrote:
 
  On all platfroms \w matches all unicode letters when used with flag
  re.UNICODE, but this doesn't work on SuSE 9.2:
 
 I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does 
 RedHat), check sys.maxunicode.
 
 This is not an explanation, but perhaps a hint where to look.

Yes, it uses UCS4.  But debian build with UCS4 works fine, so this is
not a problem.  Can --with-wctype-functions configure option be the
source of problem?

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Fredrik Lundh
Denis S. Otkidach wrote:

  On all platfroms \w matches all unicode letters when used with flag
  re.UNICODE, but this doesn't work on SuSE 9.2:

 I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
 RedHat), check sys.maxunicode.

 This is not an explanation, but perhaps a hint where to look.

 Yes, it uses UCS4.  But debian build with UCS4 works fine, so this is
 not a problem.  Can --with-wctype-functions configure option be the
 source of problem?

yes.

that option disables Python's own Unicode database, and relies on the C 
library's
wctype.h (iswalpha, etc) to behave properly for Unicode characters.  this isn't 
true
for all environments.

is this an official SuSE release?  do they often release stuff that hasn't been 
tested
at all?

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Serge Orlov
Denis S. Otkidach wrote:
 On 10 Feb 2005 03:59:51 -0800
 Serge Orlov [EMAIL PROTECTED] wrote:

   On all platfroms \w matches all unicode letters when used with
flag
   re.UNICODE, but this doesn't work on SuSE 9.2:
 [...]
  I can get the same results on RedHat's python 2.2.3 if I pass re.L
  option, it looks like this option is implicitly set in Suse.

 Looks like you are right:

  import re
  re.compile(ur'\w+', re.U).match(u'\xe4')
  from locale import *
  setlocale(LC_ALL, 'de_DE')
 'de_DE'
  re.compile(ur'\w+', re.U).match(u'\xe4')
 _sre.SRE_Match object at 0x40375560

 But I see nothing related to implicit re.L option in their patches
 and the sources themselves are the same as on other platforms.  I'd
 prefer to find the source of problem.

I found that

print u'\xc4'.isalpha()
import locale
print locale.getlocale()

produces different results on Suse (python 2.3.3)

False
(None, None)


and RedHat (python 2.2.3)

1
(None, None)

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Denis S. Otkidach
On Thu, 10 Feb 2005 17:46:06 +0100
Fredrik Lundh [EMAIL PROTECTED] wrote:

  Can --with-wctype-functions configure option be the
  source of problem?
 
 yes.
 
 that option disables Python's own Unicode database, and relies on the C 
 library's
 wctype.h (iswalpha, etc) to behave properly for Unicode characters.  this 
 isn't true
 for all environments.
 
 is this an official SuSE release?  do they often release stuff that hasn't 
 been tested
 at all?

Yes, it's official release:
# rpm -qi python
Name: python   Relocations: (not relocatable)
Version : 2.3.4 Vendor: SUSE LINUX AG, 
Nuernberg, Germany
Release : 3 Build Date: Tue Oct  5 02:28:25 2004
Install date: Fri Jan 28 13:53:49 2005  Build Host: gambey.suse.de
Group   : Development/Languages/Python   Source RPM: python-2.3.4-3.src.rpm
Size: 15108594 License: Artistic License, Other 
License(s), see package
Signature   : DSA/SHA1, Tue Oct  5 02:42:38 2004, Key ID a84edae89c800aca
Packager: http://www.suse.de/feedback
URL : http://www.python.org/
Summary : Python Interpreter
snip

BTW, where have they found something with Artistic License in Python?

-- 
Denis S. Otkidach
http://www.python.ru/  [ru]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Serge Orlov
Denis S. Otkidach wrote:
 On all platfroms \w matches all unicode letters when used with flag
 re.UNICODE, but this doesn't work on SuSE 9.2:

 Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
 [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
 Type help, copyright, credits or license for more
information.
  import re
  re.compile(ur'\w+', re.U).match(u'\xe4')
 

 BTW, is correctly recognize this character as lowercase letter:
  import unicodedata
  unicodedata.category(u'\xe4')
 'Ll'

 I've looked through all SuSE patches applied, but found nothing
 related. What is the reason for broken behavior?  Incorrect
 configure options?

To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable. 

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Peter Maas
Serge Orlov schrieb:
Denis S. Otkidach wrote:
To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable. 
What about the environment variable LANG? I have SuSE 9.1 and
LANG = de_DE.UTF-8. Your example is running well on my computer.
--
---
Peter Maas,  M+R Infosysteme,  D-52070 Aachen,  Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
---
--
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Fredrik Lundh
Peter Maas wrote:

 To summarize the discussion: either it's a bug in glibc or there is an
 option to specify modern POSIX locale. POSIX locale consist of
 characters from the portable character set, unicode is certainly
 portable.

 What about the environment variable LANG? I have SuSE 9.1 and
 LANG = de_DE.UTF-8. Your example is running well on my computer.

Python's Unicode subsystem shouldn't depend on the system's LANG
setting.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sre is broken in SuSE 9.2

2005-02-10 Thread Serge Orlov
Peter Maas wrote:
 Serge Orlov schrieb:
  Denis S. Otkidach wrote:
  To summarize the discussion: either it's a bug in glibc or there is
an
  option to specify modern POSIX locale. POSIX locale consist of
  characters from the portable character set, unicode is certainly
  portable.

 What about the environment variable LANG? I have SuSE 9.1 and
 LANG = de_DE.UTF-8. Your example is running well on my computer.

This thread is about problems only with LANG=C or LANG=POSIX, it's not
about other locales. Other locales are working as expected.

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list