[issue4971] Incorrect title case

2009-04-25 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

In r71894, makeunicodedata.py was fixed to correctly encode titlecase in
the unicodectype database (see issue5828)

In r71947, r71948, r71949, r71950, this issue is fixed by not having
titlecase fall back to uppercase at run-time anymore.

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread Matthew Barnett

New submission from Matthew Barnett pyt...@mrabarnett.plus.com:

I've found that the following 4 Unicode characters/codepoints don't
behave as I'd expect: Dž (U+01C5), Lj (U+01C8), Nj (U+01CB), Dz (U+01F2).

For example, u'\u01C5'.istitle() returns True and
unicodedata.category(u'\u01C5') returns 'Lt', but u'\u01C5'.title()
returns u'\u01C4' (DŽ), which is the uppercase equivalent.

I believe that these 4 codepoints are the only ones where the titlecase
differs from uppercase.

I thought it might be a mistake in the Unicode database. However John
Machin says:

Doesn't look like it. AFAICT it's a mistake in Objects/unicodetype.c,
function _PyUnicode_ToTitlecase.

See
http://svn.python.org/view/python/trunk/Objects/unicodectype.c?rev=66362view=markup

The code that says:
if (ctype-title)
delta = ctype-title;
else
delta = ctype-upper;
should IMHO merely be:
delta = ctype-title;

A value of zero for ctype-title should be interpreted simply as the
offset to add to the ordinal, as it is in the sibling _PyUnicode_To
(Upper|Lower)case functions. See also Tools/unicode/makeunicodedata.py
which treats upper, lower and title identically when preparing the
tables used by those 3 functions.

AFAICT making that change will fix the problem for those four
characters and not ruin any others.

The error that you noticed occurs as far back as I've looked (2.1) and
also occurs in 3.0.

--
components: Unicode
messages: 80020
nosy: mrabarnett
severity: normal
status: open
title: Incorrect title case
type: behavior
versions: Python 2.4, Python 2.5, Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread Martin v. Löwis

Changes by Martin v. Löwis mar...@v.loewis.de:


--
versions:  -Python 2.4, Python 2.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

I do think this is a bug in the Unicode database. The current approach
(of falling back to uppercase if there is no title case in the Unicode
database) goes back to r17708. However, even the prior version only
contained explicitly the cases where a titlecase was specified and
different from the uppercase.

I think part of the motivation is this note from

http://www.unicode.org/Public/UNIDATA/UCD.html

Note: The simple titlecase may be omitted in the data file if the
titlecase is the same as the uppercase.

(notice that for uppercase, it says instead The simple uppercase is
omitted in the data file if the uppercase is the same as the code point
itself, likewise for lowercase)

Considering this note, the simple titlecase of U+01C5 *is* U+01C4: the
titlecase value is omitted, hence it is the same as uppercase, hence it
is U+01C4.

Most likely, the algorithm to produce the database was different from
the documented algorithm, and it is a bug in UCD.html. However, if
UCD.html is correct, it is likely a bug in UnicodeData.txt.

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Martin:Considering this note, the simple titlecase of U+01C5 *is*
U+01C4: the titlecase value is omitted, hence it is the same as
uppercase, hence it is U+01C4.

Perhaps we are looking at different files; in the Unicode 5.1
UnicodeData.txt that I downloaded
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), the title field
for U+01C5 is *NOT* omitted, it is set to 01C5. AFAICT the intention is
that the four characters in question are their own titlecase, which is
not altogether unexpected given their visual representation.

Here's the record for U+01C5:
01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON;Lt;0;L;compat 0044 017EN;LATIN LETTER CAPITAL D SMALL Z
HACEK;;01C4;01C6;01C5

The note (which I hadn't noticed and explains the mention of
ctype-upper in the _PyUnicode_ToTitlecase function) says that the
titlecase value may be omitted if it is the same as the uppercase. FWIW
there are *no* examples in the current (5.1) file where the title field
is empty and the upper field is not empty. 

ISTM the problem is that implementing the default-to-uppercase was not
done in Tools/unicode/makeunicodedata.py where full information is
available. This left no way in _PyUnicode_ToTitlecase of resolving the
ambiguity of a zero value for ctype-title -- is it no titlecase
supplied so use uppercase or is it titlecase supplied, delta == 0,
means ch.title() - ch?

--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread Martin v. Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Perhaps we are looking at different files; 

Indeed, I was looking at the 3.2.0 database (assuming that it would be
the same in subsequent versions).

 ISTM the problem is that implementing the default-to-uppercase was not
 done in Tools/unicode/makeunicodedata.py where full information is
 available. This left no way in _PyUnicode_ToTitlecase of resolving the
 ambiguity of a zero value for ctype-title -- is it no titlecase
 supplied so use uppercase or is it titlecase supplied, delta == 0,
 means ch.title() - ch?

Correct. So it seems this needs to be fixed in makeunicodedata.py
already. This was not the case with earlier versions of Unicode (which
never had a mapping to the same code point).

The logic for using deltas is also incorrect, so makeunicodedata.py
needs to be fixed anyway.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com