[issue4610] Unicode case mappings are incorrect

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.06.2013 00:52, Alexander Belopolsky wrote:
 
 Alexander Belopolsky added the comment:
 
 There has been a relatively recent discussion of case mappings under #12753 
 (msg144836).
 
 I personally agree with Martin: str.upper/lower should remain the way it is - 
 a simplistic 1-to-1 mapping using UnicodeData.txt fields.  More sophisticated 
 case mapping algorithms belong to a specialized library module not python 
 core.
 
 The behavior of .title() and .capitalize() is harder to defend, so if someone 
 can point out to a python library (PyICU?) that gets it right we can 
 reference it in the documentation.

.title() and .capitalize() are 1-1 mappings as well. Python only supports
Simple Case Operations and does not support Full Case Operations
which require parsing context (SpecialCasing.txt).

ICU does provide support for both:
http://userguide.icu-project.org/transforms/casemappings

PyICU wraps ICU, but it is not clear to me how you'd access those
mappings (the package doesn't provide dcoumentation on the API, instead
just gives a description of how to map the C++ API to a Python one):
https://pypi.python.org/pypi/PyICU

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

There has been a relatively recent discussion of case mappings under #12753 
(msg144836).

I personally agree with Martin: str.upper/lower should remain the way it is - a 
simplistic 1-to-1 mapping using UnicodeData.txt fields.  More sophisticated 
case mapping algorithms belong to a specialized library module not python core.

The behavior of .title() and .capitalize() is harder to defend, so if someone 
can point out to a python library (PyICU?) that gets it right we can reference 
it in the documentation.

--
versions: +Python 3.4 -Python 2.6, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

It looks like at least the OP issue has been fixed in #12736:

 'ß'.upper()
'SS'

--
resolution:  - out of date
status: open - closed
superseder:  - Request for python casemapping functions to use full not simple 
casemaps per Unicode's recommendation

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Jeff Senn

Jeff Senn s...@users.sourceforge.net added the comment:

 Feel free to upload it here. I'm fairly skeptical that it is
 possible to implement casing correctly in a locale-independent
 way.

Ok. I will try to find time to complete it enough to be readable.

Unicode (see sec 3.13) specifies the casing of unicode strings pretty 
completely -- i.e. it gives Default Casing rules to be used when no 
locale specific tailoring is available.  The only dependencies on 
locale for the special casing rules are for Turkish, Azeri, and 
Lithuanian.  And you only need to know that that is the language, no 
other details.  So I'm sure that a complete implementation is possible 
without resort to a lot of locale munging -- at least for .lower() 
.upper() and .title().

.swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

However .capitalize() is a bit weird; and I'm not sure it isn't 
incorrectly implemented now:

It UPPERCASES the first character, rather than TITLECASING, which is 
probably wrong in the very few cases where it makes a difference:
e.g. (using Croatian ligatures)

 u'\u01c5amonjna'.title()
u'\u01c4amonjna'
 u'\u01c5amonjna'.capitalize()
u'\u01c5amonjna'

Capitalization is not precisely defined (by the Unicode standard) -- 
the currently python implementation doesn't even do what the docs say: 
makes the first character have upper case (it also lower-cases all 
other characters!), however I might argue that a more useful 
implementation makes the first character have titlecase...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Jeff Senn

Jeff Senn s...@users.sourceforge.net added the comment:

Yikes! I just noticed that u''.title() is really broken! 

It doesn't really pay attention to word breaks -- 
only characters that have case.  
Therefore when there are (caseless)
combining characters in a word it's really broken e.g.

 u'n\u0303on\u0303e'.title()
u'N\u0303On\u0303E'

That is (where '~' is combining-tilde-over)
n~on~e -title-cases-to- N~On~E

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Jeff Senn wrote:
 
 Jeff Senn s...@users.sourceforge.net added the comment:
 
 Yikes! I just noticed that u''.title() is really broken! 
 
 It doesn't really pay attention to word breaks -- 
 only characters that have case.  
 Therefore when there are (caseless)
 combining characters in a word it's really broken e.g.
 
 u'n\u0303on\u0303e'.title()
 u'N\u0303On\u0303E'
 
 That is (where '~' is combining-tilde-over)
 n~on~e -title-cases-to- N~On~E

Please have a look at http://bugs.python.org/issue6412 - that patch
addresses many casing issues, at least up the extent that we can
actually fix them without breaking code relying on:

len(s.upper()) == len(s)

for upper/lower/title.

If we add support for 1-n code point mappings, then we can only
enable this support by using an option to the casing methods (perhaps
not a bad idea: the parameter could be used to signal the local
to assume).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Jeff Senn wrote:
 However .capitalize() is a bit weird; and I'm not sure it isn't 
 incorrectly implemented now:
 
 It UPPERCASES the first character, rather than TITLECASING, which is 
 probably wrong in the very few cases where it makes a difference:
 e.g. (using Croatian ligatures)
 
 u'\u01c5amonjna'.title()
 u'\u01c4amonjna'
 u'\u01c5amonjna'.capitalize()
 u'\u01c5amonjna'
 
 Capitalization is not precisely defined (by the Unicode standard) -- 
 the currently python implementation doesn't even do what the docs say: 
 makes the first character have upper case (it also lower-cases all 
 other characters!), however I might argue that a more useful 
 implementation makes the first character have titlecase...

You don't have to worry about .capitalize() and .swapcase() :-)

Those methods are defined by their implementation and don't resemble
anything defined in Unicode.

I agree that they are, well, not that useful.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-14 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

 .swapcase() is just ...err... dumb^h^h^h^h questionably useful. 

FWIW, it appears that the original use case (as an Emacs macro) was to
correct blocks of text where touch typists had accidentally left the
CapsLocks key turned on:  tHE qUICK bROWN fOX jUMPED oVER tHE lAZY dOG.

I agree with the rest of you that Python would be better-off without
swapcase().

--
nosy: +rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-13 Thread Jeff Senn

Jeff Senn s...@users.sourceforge.net added the comment:

Has there been any action on this? a PEP?

I disagree that using ICU is good way to simply get proper
unicode casing. (A heavy hammer for a small task...)

I agree locales are a different issue (and would prefer
optional arguments to the unicode object casing methods -- 
that could then be used within any future sort of locale object 
to handle correct casing -- but don't rely on such.)

Most of the special casing rules can be accomplished by 
a decomposition (or recursive decomposition) on the character
followed by casing the result -- so NO new table is necessary
-- only marking up the characters so implicated (there are
extra unused bits in the char type table that could be used 
for this purpose -- so no additional space needed there either).  

What remains are a tiny handful of cases that need to be handled
in code.

I have a half finished implementation of this, in case anyone
is interested.

--
nosy: +senn

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2009-10-13 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 I have a half finished implementation of this, in case anyone
 is interested.

Feel free to upload it here. I'm fairly skeptical that it is
possible to implement casing correctly in a locale-independent
way.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-20 Thread Alex Stapleton

Alex Stapleton al...@prol.etari.at added the comment:

I am trying to get a PEP together for this. Does anyone have any thoughts 
on how to handle comparison between unicode strings in a locale aware 
situation?

Should __lt__ and __gt__ be specified as ignoring locale? In which case do 
we need to add a new method for doing locale aware comparisons?

Should locale be a property of the string, an argument passed to 
upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
(locale module...)?

Should doing a locale aware comparison of two strings with different 
locales throw an exception?

Should locales be represented as objects or just a string like en_GB?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-20 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 I am trying to get a PEP together for this. Does anyone have any thoughts 
 on how to handle comparison between unicode strings in a locale aware 
 situation?

Implementation-wise, or specification-wise? Implementation-wise, you can
either try to use the C library, or ICU. For portability, ICU is better;
for maintenance, the C library. Specification-wise: it should just
Do The Right Thing, and probably be exposed either through the locale
module, or through locale objects (in case you want to operate on
multiple different locales in a single program) - see other OO languages
on how they provide locales.

 Should __lt__ and __gt__ be specified as ignoring locale?

Yes.

 In which case do 
 we need to add a new method for doing locale aware comparisons?

No. Collation is a feature of the locale, not of the strings.

 Should locale be a property of the string, an argument passed to 
 upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
 (locale module...)?

Either global state, or the object *that gets the strings passed to it*.

 Should doing a locale aware comparison of two strings with different 
 locales throw an exception?

Strings should not be tied into locales.

 Should locales be represented as objects or just a string like en_GB?

If you want to have multiple of them simultaneously, you need objects.
You still need to identify them by name.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-20 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

On 2008-12-20 17:19, Alex Stapleton wrote:
 Alex Stapleton al...@prol.etari.at added the comment:
 
 I am trying to get a PEP together for this. Does anyone have any thoughts 
 on how to handle comparison between unicode strings in a locale aware 
 situation?

Some thoughts:

 * the Unicode implementation *must* stay locale independent

 * we should implement the Unicode collation algorithm
   (TR#10, http://unicode.org/reports/tr10/)

 * which collation to use should be a parameter of a function
   or object initializer and it should be possible to use
   multiple collations in the same application (without switching
   the locale)

 * the terms locale and collation should not be mixed;
   a (default) collation is a property of a locale and there can
   also be more than one collation per locale

The Unicode collation algorithm defines collation in terms of a
key function for each collation, so that already fits nicely with
the key function parameter of list.sort().

 Should __lt__ and __gt__ be specified as ignoring locale? In which case do 
 we need to add a new method for doing locale aware comparisons?

Unicode strings should not get any locale or collation specific
methods. Instead this feature should be implemented elsewhere
and the strings in question passed to this new function or
object.

 Should locale be a property of the string, an argument passed to 
 upper/lower/isupper/islower/swapcase/capitalize/sort or global state 
 (locale module...)?

No. See above.

 Should doing a locale aware comparison of two strings with different 
 locales throw an exception?

No, assigning locales to strings is not going to work and
we should not go down that road.

It's better to have locale aware functions for certain operations,
so that you can pass your Unicode strings to these function
instead of binding additional context information to the Unicode
strings themselves.

 Should locales be represented as objects or just a string like en_GB?

I think the easiest way to get the collation algorithm implemented
is by using a similar scheme as for codecs: you pass a collation
name to a central function and get back a collation object that
implements the collation in form of a key method and a compare
method.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-10 Thread Marc-Andre Lemburg

Marc-Andre Lemburg [EMAIL PROTECTED] added the comment:

Python uses the Unicode database for the mapping and this only contains
1-1 mappings. The special cases (mostly 1-2 mappings) are not included.

It would be nice to have them available as well, but I guess we'd have
to write them in code rather than invent a new mapping table for them.

Furthermore, there are a few cases like e.g. the Turkish i where case
mappings depend on external context such as the language the code point
is used in - those cases are difficult to get right.

We may need to extend the .lower()/.upper()/.title() methods with an
optional parameter that allow providing this extra context information
to the methods.

BTW: 'ß' is being phased out in German. The new writing rules encourage
using 'ss' or 'SS' instead (which is not entirely correct, since 'ß'
originated from 'sz' used some hundred or so years ago, but those are
just details ;-).

--
nosy: +lemburg

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-10 Thread Alex Stapleton

Alex Stapleton [EMAIL PROTECTED] added the comment:

I agree with loewis that ICU is probably the best way to get this 
functionality into Python.

lemburg, yes it seems like extending those methods would be required at 
the very least. We would probably also need to support ICUs collators as 
well I think.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-09 Thread Alex Stapleton

New submission from Alex Stapleton [EMAIL PROTECTED]:

Following a discussion on reddit it seems that the unicode case
conversion algorithms are not being followed.

$ python3.0
Python 3.0rc1 (r30rc1:66499, Oct 10 2008, 02:33:36) 
[GCC 4.0.1 (Apple Inc. build 5488)] on darwin
Type help, copyright, credits or license for more information.
 x='ß'
 print(x, x.upper())
ß ß

This conversion is correct as defined in UnicodeData.txt however
http://unicode.org/Public/UNIDATA/SpecialCasing.txt defines a more
complete set of case conversions.

According to this file ß.upper() should be SS. Presumably Python
simply isn't using this file to create it's mapping database.

--
components: Unicode
messages: 77417
nosy: alexs
severity: normal
status: open
title: Unicode case mappings are incorrect
type: behavior
versions: Python 2.6, Python 3.0

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4610] Unicode case mappings are incorrect

2008-12-09 Thread Martin v. Löwis

Martin v. Löwis [EMAIL PROTECTED] added the comment:

I have known this problem for years, and decided not to act; I don't
consider it an important problem. Implementing it properly is
complicated by the fact that some of the case mappings are conditional
on the locale.

If you consider it important, please submit a patch.

I'd rather see efforts put into an integration of ICU, which should
solve this problem and many others with Python's locale support.

--
nosy: +loewis

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4610
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com