from:"Tom Christiansen"

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
which fully implements tr11.  It includes Unicode::GCString, a class
that has a columns() method to determine the print columns.  This is very
fancy in the case of Asian widths, but of course there are many other cases too.

If you'd like, I can show you a program that uses these, a rewrite the
standard Unix fmt(1) filter that works properly on Unicode column widths.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12568
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Martin v. L=C3=B6wis mar...@v.loewis.de added the comment:

 Martin, I think you meant to write if w =3D=3D 'A':.
 Some very common characters have ambiguous widths though (e.g. the Greek =
alphabet), so you can't just raise an error for them.

That's precisely why I don't think this should be in the library, but
in the application. Application developers who need that also need
to concern themselves with the border cases, and decide on how
they need to resolve them.

The column-width of a string is not an application issue.  It is
well-defined by Unicode.  Again, please see how we've done it in 
Perl, where tr11 is fully implemented.  The columns() method from 
Unicode::GCString always gives the right answer per the Standard for
any string, even what you are calling ambiguous ones.

This is not an applications issue -- at all.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12568
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12568] Add functions to get the width in columns of a character

2012-03-10 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Martin v. L=C3=B6wis mar...@v.loewis.de added the comment:

 I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
 which fully implements tr11.

Thanks for the pointer!

 If you'd like, I can show you a program that uses these, a rewrite the
 standard Unix fmt(1) filter that works properly on Unicode column widths.

I believe there can't be any truly proper implementation, as you
can't be certain how the terminal will handle these itself. 

Hm.  I think we may not be talking about the same thing after all.

If we're talking about the Curses library, or something similar,
this is not the same.  I do not think Curses has support for 
combining characters, right to left text, wide characters, etc.

However, Unicode does, and defines the column width for those.

I have an illustration of what this looks like in the picture
in the very last recipe, #44, in 

http://training.perl.com/scripts/perlunicook.html

That is what I have been talking about by print widths.  It's running
in a Mac terminal emulator, and unlike the HTML which grabs from too
many fonts, the terminal program does the right thing with the widths.

Are we talking about different things?

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12568
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-20 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Yes, it looks good.  Thank you very much.

-tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12568] Add functions to get the width in columns of a character

2011-10-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Martin v. Löwis mar...@v.loewis.de added the comment:

 I think the WideCharToMultibyte approach is just incorrect.

 I'm -1 on using wcswidth, though. 

Like you, I too seriously question using wcswidth() for this at all:

The wcswidth() function either shall return 0 (if pwcs points to a
null wide-character code), or return the number of column positions
to be occupied by the wide-character string pointed to by pwcs, or
return -1 (if any of the first n wide-character codes in the wide-
character string pointed to by pwcs is not a printable wide-
character code).

I would be willing to bet (a small amount of) money it does not correctly
inplmented Unicode print widths, even though one would certainly *think* it
does according to this:

 The wcswidth() function determines the number of column positions
 required for the first n characters of pwcs, or until a null wide
 character (L'\0') is encountered.

There are a bunch of interesting cases I would want it tested against.

 We already have unicodedata.east_asian_width, which implements 
 http://unicode.org/reports/tr11/ 

 The outcomes of this function are these:
 - F: full-width, width 2, compatibility character for a narrow char
 - H: half-width, width 1, compatibility character for a narrow char
 - W: wide, width 2
 - Na: narrow, width 1
 - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
 - N: neutral; not used in Asian text, so has no width. Practically, width can 
 be considered as 1

Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
And EA=N cannot be consider 1, either.

For example, some of the Marks are EA=A and some are EA=N, yet how may
print columns they take varies.  It is usually 0, but can be 1 at the start
of the file/string or immediately after a linebreak sequence.  Then there
are things like the variation selectors which are never anything.

Now consider the many \pC code points, like 

U+0009  CHARACTER TABULATION
U+00AD  SOFT HYPHEN 
U+200C  ZERO WIDTH NON-JOINER
U+FEFF  ZERO WIDTH NO-BREAK SPACE
U+2062  INVISIBLE TIMES

A TAB is its own problem but SHY we know is only width=1 immediately
before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
width=0.  So are the INVISIBLE * code points.

Context:

Imagine you're trying to format a string so that it takes up exactly 20
columns: you need to know how many spaces to pad it with based on the
print width.  That is what the #12568 is needing
to do, and you have to do much more than East Asian Width properties.

I really do think that what #12568 is asking for is to have the equivalent
of the Perl Unicode::GCString's columns() method, and that you aren't going
to be able to handle text alignment of Unicode with anything that is much
less of that.  After all, #12568's title is Add functions to get the width
in columns of a character.  I would very much like to compare what
columns() thinks compared with what wcswidth() thinks.  I bet wcswidth() is
very simple-minded at best.

I may of course be wrong.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12568
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-09 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sun, 09 Oct 2011 13:21:00 -: 

 Here is a new patch that stores the names of aliases and named
 sequences in the Private Use Area.

Looks good!  Thanks!

--tom

--
title: \N{...} neglects formal aliases and named sequences from Unicode 
charnames namespace - \N{...} neglects formal aliases and named sequences from 
Unicode charnames namespace

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-03 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Mon, 03 Oct 2011 04:15:51 -: 

 But it still has to happen at compile time, of course, so I don't know
 what you could do in Python.  Is there any way to change how the compiler
 behaves even vaguely along these lines?

 I think things like from __future__ import ... do something similar,
 but I'm not sure it will work in this case (also because you will have
 to provide the list of aliases somehow).

Ah yes, that's right.  Hm.  I bet then it *would* be possible, just perhaps
a bit of a run-around to get there.  Not a high priority, but interesting.

 less readable than:
 
 def my_capitalize(s):
return s[0].upper() + s[1:].lower()

 You could argue that the first is much more explicit and in a way
 clearer, but overall I think you agree with me that is less readable.

Certainly.

It's a bit like the way bug rate per lines of code is invariant across
programming languages.  When you have more opcodes, it gets harder to
understand because there are more interactions and things to remember.

 That really isn't right.  A cased character is one with the Unicode Cased
 property, and a lowercase character is one wiht the Unicode Lowercase
 property.  The General Category is actually immaterial here.

 You might want to take a look and possibly add a comment on #12204 about this.

 I've spent all bloody day trying to model Python's islower, isupper, and 
 istitle
 functions, but I get all kinds of errors, both in the definitions and in the
 models of the definitions.

 If by model you mean trying to figure out how they work, it's
 probably easier to look at the implementation (I assume you know
 enough C to understand what they do).  You can find the code for
 str.istitle() at http://hg.python.org/cpython/file/default/Objects/un-
 icodeobject.c#l10358 and the actual implementation of some macros like
 Py_UNICODE_ISTITLE at
 http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

Thanks, that helps immensely.  I'm completely fluent in C.  I've gone 
and built a tags file of your whole v3.2 source tree to help me navigate.

The main underlying problem is that the internal macros are defined in a
way that made sense a long time ago, but no longer do ever since (for
example) the Unicode lowercase property stopped being synonymous with
GC=Ll and started also including all code points with the
Other_Lowercase property as well.

The originating culprit is Tools/unicode/makeunicodedata.py.
It builds your tables only using UnicodeData.txt, which is
not enough.  For example:

if category in [Lm, Lt, Lu, Ll, Lo]:
flags |= ALPHA_MASK
if category == Ll:
flags |= LOWER_MASK
if 'Line_Break' in properties or bidirectional == B:
flags |= LINEBREAK_MASK
linebreaks.append(char)
if category == Zs or bidirectional in (WS, B, S):
flags |= SPACE_MASK
spaces.append(char)
if category == Lt:
flags |= TITLE_MASK
if category == Lu:
flags |= UPPER_MASK

It needs to use DerivedCoreProperties.txt to figure out whether
something is Other_Uppercase, Other_Lowercase, etc. In particular:

Alphabetic := Lu+Ll+Lt+Lm+Lo + Nl + Other_Alphabetic
Lowercase  := Ll + Other_Lowercase
Uppercase  := Ll + Other_Uppercase

This affects a lot of things, but you should be able to just fix it
in Tools/unicode/makeunicodedata.py and have all of them start
working correctly.

You will probably also want to add 

Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch)

that uses the UTS#18 Annex C definition, so that you catch marks, too.
That definition is:

Word := Alphabetic + Mc+Me+Mn + Nd + Pc

where Alphabetic is defined above to include Nl and Other_Alphabetic.

Soemwhat related is stuff like this:

typedef struct {
const Py_UCS4 upper;
const Py_UCS4 lower;
const Py_UCS4 title;
const unsigned char decimal;
const unsigned char digit;
const unsigned short flags;
} _PyUnicode_TypeRecord;

There are two different bugs here.  First, you are missing 

const Py_UCS4 fold;

which is another field from UnicodeData.txt, one that is critical 
for doing case-insensitive matches correctly.

Second, there's also the problem that Py_UCS4 is an int.  That means you
are stuck with just the character-based simple versions of upper-, title-,
lower-, and foldcase.  You need to have fields for the full mappings, which
are now strings (well, int arrays) not single ints.  I'll use ??? for the
int-array type that I don't know:

const ??? upper_full;
const ??? lower_full;
const ??? title_full;
const ??? fold_full;

You will also need to extend the API from just

Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch)

to something like

??? _PyUnicode_ToUppercase_Full(Py_UCS4 ch)

I don't know what the ??? return type is there, but it's whatever the
upper_full filed

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sun, 02 Oct 2011 06:46:26 -: 

 Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec=
 ause that's a Unicode 1 name, and nowadays these codepoints are simply mark=
 ed as 'control'.

Yes, but there are a lot of them, 65 of them in fact.  I do not care to 
see people being forced to use literal control characters or inscrutable
magic numbers.  It really bothers me that you have all these defined code 
points with properties and all that have no name.   People do use these.
Some of them a lot.  I don't mind \n and such -- and in fact, prefer them 
even -- but I feel I should not have scratch my head over character \033, \0177,
and brethren.  The C0 and C1 standards are not just inventions, so we use 
them.  Far better than one should write \N{ESCAPE} for \033 or \N{DELETE} 
for \0177, don't you think?  

 If so, then I don't understand that.  Nobody in their right=20
 mind prefers \N{LINE FEED (LF)} over \N{LINE FEED} -- do they?

 They probably don't, but they just write \n anyway.  I don't think we need =
 to support any of these aliases, especially if they are not defined in the =
 Unicode standard.

If you look at Names.txt, there are significant aliases there for 
the C0/C1 stuff.  My bottom line is that I don't like to be forced
to use magic numbers.  I prefer to name my abstactions.  It is more
readable and more maintainble that way.   

There are still holes of course.  Code point 128 has no name even in C1.
But something is better than nothing.  Plus at least in Perl we *can* give
things names if we want, per the APPLE LOGO example for U+F8FF.  So nothing
needs to remain nameless.  Why, you can even name your Kanji if you want, 
using whatever Romanization you prefer.  I think the private-use case
example is really motivating, but I have no idea how to do this for Python
because there is no lexical scope.  I suppose you could attach it to the
module, but that still doesn't really work because of how things get evaluated.
With a Perl compile-time use, we can change the compiler's ideas about
things, like adding function prototypes and even extending the base types:

% perl -Mbigrat -le 'print 1/2 + 2/3 * 4/5'
31/30

% perl -Mbignum -le 'print 21-is_odd'
1
% perl -Mbignum -le 'print 18-is_odd'
0

% perl -Mbignum -le 'print substr(2**5000, -3)'
376
% perl -Mbignum -le 'print substr(2**5000-1, -3)'
375

% perl -Mbignum -le 'print length(2**5000)'
1506
% perl -Mbignum -le 'print length(10**5000)'
5001

% perl -Mbignum -le 'print ref 10**5000'
Math::BigInt
% perl -Mbigrat -le 'print ref 1/3'
Math::BigRat

I recognize that redefining what sort of object the compiler treats some 
of its constants as is never going to happen in Python, but we actually
did manage that with charnames without having to subclass our strings:
the hook for \N{...} doesn't require object games like the ones above.

But it still has to happen at compile time, of course, so I don't know
what you could do in Python.  Is there any way to change how the compiler
behaves even vaguely along these lines?

The run-time looks of Python's unicodedata.lookup (like Perl's
charnames::viacode) and unicodedata.name (like Perl's charnames::viacode
on the ord) could be managed with a hook, but the compile-time lookups
of \N{...} I don't see any way around.  But I don't know anything about
Python's internals, so don't even know what is or is not possible.

I do note that if you could extend \N{...} the way we do with charname
aliases for private-use characters, the user could load something that 
did the C0 and C1 control if they wanted to.  I just don't know how to 
do that early enough that the Python compiler would see it.  Your import
happens at run-time or at compile-time?  This would be some sort of
compile-time binding of constants.

d=20
 Python doesn't require it. :)/2

 I actually find those *less* readable.  If there's something fancy in the r=
 egex, a comment *before* it is welcomed, but having to read a regex divided=
 on several lines and remove meaningless whitespace and redundant comments =
 just makes the parsing more difficult for me.

Really?  White space makes things harder to read?  I thought Pythonistas
believed the opposite of that.  Whitespace is very useful for cognitive
chunking: you see how things logically group together.

Inomorewantaregexwithoutwhitespacethananyothercodeortext. :)

I do grant you that chatty comments may be a separate matter.

White space in patterns is also good when you have successive patterns
across multiple lines that have parts that are the same and parts that
are different, as in most of these, which is from a function to render
an English headline/book/movie/etc title into its proper casing:

# put into lowercase if on our stop list, else titlecase
s/  ( \pL [\pL']* )  /$stoplist{$1} ? lc

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-02 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Really? White space makes things harder to read? I thought Pythonistas
believed the opposite of that.

I was surprised at that too ;-). One person's opinion in a specific
context. Don't generalize.

The example I initially showed probably wasn't the best for that.
Mostly I was trying to demonstrate how useful it is to have user-defined
properties is all. But I have no asked for that (I have asked for properties,
though).

English titling rules
only capitalize the first word in hyphenated words, which is why it's
Anti‐intellectual not Anti-Intellectual.

Except that I can imagine someone using the latter as a noun to make the
work more officious or something.

If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING
is better still. :)

There are no official English titling rules and as you noted,
publishers vary.

If there aren't any rules, then how come all book and movie titles always
look the same? :) I don't think anyone would argue with these two:

1. Capitalize the first word, the last word, and the word right after a
colon (or semicolon).

2. Capitalize all intervening words except for articles (a, an, the)
and short prepositions.

Those are the basic rules. The main problem is that short isn't
well defined--and indeed, there are even places where preposition
isn't well defined either.

English has sentence casing (only the first word) and headline casing (most of
them).
It's problematic that computer people call capitalizing each word titlecasing,
since in English, this is never correct.

http://www.chicagomanualofstyle.org/CMS_FAQ/CapitalizationTitles/CapitalizationTitles23.html

Although Chicago style lowercases prepositions (but see CMOS 8.157
for exceptions), some style guides uppercase them. Ask your editor
for a style guide.

I myself usually fall back to the Chicago Manual of Style or the Oxford
Guide to Style. I don't think I do anything neither of them says to do.

But I completely agree that this should *not* be in the titlecase()
function. I think the docs for the function might perhaps say something
about how it does not mean correct English headline case when it says
titlecase, but that's largely just nitpicking.

I agree that str.title should do something sensible
based on Unicode, with the improvements you mentioned.

One of the goals of Unicode is that casing not be language dependent. And
they almost got there, too. The Turkic I is the most notable exception.

Did you know there is a problem with all the case stuff in Python? It
was clearly put in before they had realized that they needed to have
things other the Lu/Lt/Ll have casing properties. That's why there is
a difference betwen GC=Ll and the Lowercase property.

str.islower()

Return true if all cased characters in the string are lowercase and
there is at least one cased character, false otherwise. Cased
characters are those with general category property being one of
“Lu”, “Ll”, or “Lt” and lowercase characters are those with general
category property “Ll”.

http://docs.python.org/release/3.2/library/stdtypes.html

That really isn't right. A cased character is one with the Unicode Cased
property, and a lowercase character is one wiht the Unicode Lowercase
property. The General Category is actually immaterial here.

I've spent all bloody day trying to model Python's islower, isupper, and istitle
functions, but I get all kinds of errors, both in the definitions and in the
models of the definitions.Under both 2.7 and 3.2, I get all these bugs:

ᶜ not islower() but has at least one cased character with all cased
characters lowercase!
ᴰ not islower() but has at least one cased character with all cased
characters lowercase!
ⓚ not islower() but has at least one cased character with all cased
characters lowercase!
ͅ not islower() but has at least one cased character with all cased
characters lowercase!
Ⅷ not isupper() but has at least one cased character with all cased
characters uppercase!
Ⅷ not istitle() but should be
ⅷ not islower() but has at least one cased character with all cased
characters lowercase!
2ⁿᵈ not islower() but has at least one cased character with all cased
characters lowercase!
2ᴺᴰ not islower() but has at least one cased character with all cased
characters lowercase!
Ὰͅ isupper() but fails to have at least one cased character with all cased
characters uppercase!
ThisIsInTitleCaseYouKnow not istitle() but should be
Mᶜ isupper() but fails to have at least one cased character with all cased
characters uppercase!
ᶜM isupper() but fails to have at least one cased character with all cased
characters uppercase!
ᶜM istitle() but should not be
MᶜKINLEY isupper() but fails to have at least one cased character with all
cased characters uppercase!

I really don't understand.BTW

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-10-01 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Martin v. Löwis rep...@bugs.python.org wrote
   on Sat, 01 Oct 2011 10:59:48 -: 

  * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

 Where did you get that definition from? UTS#18 defines
 word_character, which is Alphabetic + U+200C + U+200D
 (i.e. not including marks, but including those

From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character 
is defined to be 

 \p{alpha}
 \p{gc=Mark}
 \p{digit}
 \p{gc=Connector_Punctuation}

 I think you are looking for here are Word characters without 
 Nd + Pc, so just Alphabetic + Mn+Mc+Me.  
 
 Is that right?
 
 With your definition of Word character above, yes, that's right.

It's not mine.  It's tr18's.

 Marks won't start a word, though.

That's the smarter boundary thing they talk about.  

I'm not myself familiar with \pM

 As for terminology: I think the documentation should continue to
 speak about words and letters, and then define what is meant
 in this context. It's not that the Unicode consortium invented
 the term letter, so we should use it more liberally than just
 referring to the L* categories.

I really don't think it wise to have private definitions of these.

If Letter doesn't mean L?, things get too weird.  That's why 
there are separate definitions of alphabetic, word, etc.

--tom

--
title: str.title() is overzealous by upcasing combining marks   inappropriately 
- str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12737
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-10-01 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Perl does not provide the old 1.0 names at all.  We don't have a Unicode
 1.0 legacy to support, which makes this cleaner.  However, we do provide
 for the names of the C0 and C1 Control Codes, because apart from Unicode
 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20

 If there would be a reasonably official source for these names, and one
 that guarantees that there is no collision with UCD names, I could
 accept doing so for Python as well.

The C0 and C1 control code names don't change.  There is/was one stability
issue where they screwed up, because they ended up having a UAX (required)
and a UTS (not required) fighting because of the dumb stuff they did with
the Emoji names. They neglected to prefix them with Emoji ... or some
such, the way things like GREEK ... LETTER ... or MATHEMATICAL ... or
MUSICAL ... did.  The problem is they stole BELL without calling it EMOJI
BELL.  This is C0 name for Control-G.  Dimwits.

The problem with official names is that they have things in them that you
are not expected in names.  Do you really and truly mean to tell me you
think it is somehow **good** that people are forced to write

\N{LINE FEED (LF)}

Rather than the more obvious pair of 

\N{LINE FEED}
\N{LF}

??

If so, then I don't understand that.  Nobody in their right 
mind prefers \N{LINE FEED (LF)} over \N{LINE FEED} -- do they?

% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{LINE FEED}'
U+000A
% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{LF}'
U+000A
% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{LINE FEED (LF)}'
U+000A

% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{NEXT LINE}'
U+0085
% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{NEL}'
U+0085
% perl -Mcharnames=:full -le 'printf U+%04X\n, ord \N{NEXT LINE (NEL)}'
U+0085

 We also provide for certain well known aliases from the Names file:
 anything that says * commonly abbreviated as ..., so things like LRO
 and ZWJ and such.

 -1. Readability counts, writability not so much (I know this is
 different for Perl :-). 

I actually very strongly resent and rebuff that entire mindset in the most
extreme way possible.  Well-written Perl code is perfectly readable by
people who speak that langauge.  If you find Perl code that isn't readable,
it is by definition not well-written.

*PLEASE* don't start.  

Yes, I just got done driving 16 hours and am overtired, but it's 
something I've been fighting against all of professional career.
It's a leyenda negra.

 If there is too much aliasing, people will
 wonder what these codes actually mean.

There are 15 commonly abbreviated as aliases in the Names.txt file.

* commonly abbreviated as NBSP
* commonly abbreviated as SHY
* commonly abbreviated as CGJ
* commonly abbreviated ZWSP
* commonly abbreviated ZWNJ
* commonly abbreviated ZWJ
* commonly abbreviated LRM
* commonly abbreviated RLM
* commonly abbreviated LRE
* commonly abbreviated RLE
* commonly abbreviated PDF
* commonly abbreviated LRO
* commonly abbreviated RLO
* commonly abbreviated NNBSP
* commonly abbreviated WJ

All of the standards documents *talk* about things like LRO and ZWNJ.
I guess the standards aren't readable then, right? :)

From the charnames manpage, which shows that we really don't just make
these up as we feel like (although we could; see below).  They're all from
this or that standard:

ALIASES
   A few aliases have been defined for convenience: instead
   of having to use the official names

   LINE FEED (LF)
   FORM FEED (FF)
   CARRIAGE RETURN (CR)
   NEXT LINE (NEL)

   (yes, with parentheses), one can use

   LINE FEED
   FORM FEED
   CARRIAGE RETURN
   NEXT LINE
   LF
   FF
   CR
   NEL

   All the other standard abbreviations for the controls,
   such as ACK for ACKNOWLEDGE also can be used.

   One can also use

   BYTE ORDER MARK
   BOM

   and these abbreviations

   AbbreviationFull Name

   CGJ COMBINING GRAPHEME JOINER
   FVS1MONGOLIAN FREE VARIATION SELECTOR ONE
   FVS2MONGOLIAN FREE VARIATION SELECTOR TWO
   FVS3MONGOLIAN FREE VARIATION SELECTOR THREE
   LRE LEFT-TO-RIGHT EMBEDDING
   LRM LEFT-TO-RIGHT MARK
   LRO LEFT-TO-RIGHT OVERRIDE
   MMSPMEDIUM MATHEMATICAL SPACE
   MVS MONGOLIAN VOWEL SEPARATOR
   NBSPNO-BREAK SPACE
   NNBSP   NARROW NO-BREAK SPACE
   PDF POP DIRECTIONAL FORMATTING
   RLE RIGHT-TO-LEFT EMBEDDING

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-09-30 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Martin v. Löwis mar...@v.loewis.de added the comment:

 Split S into words. Change the first letter in a word to upper-case,

Except that I think you actually mean that the first letter is 
changed into titlecase not uppercase.  

One might also say *try* to change for all these, in that not
all cased code points in Unicode have casemaps that are different
from themselves.  For example, a superscript lowercase a or b has
no distinct uppercase mapping, the way the non-superscript versions do:

% (echo xyz; echo ab AB | unisupers) | uc
XYZ
ᵃᵇ ᴬᴮ

 and all subsequent letters to lower case. A word is a sequence that
 starts with a letter, followed by letter-related characters.

I don't like the way you have defined letters and letter-related
characters.  The first already has a definition, which is not the
one you are using.  Word characters also has a definition in Unicode,
and it is not the one you are using.  I strongly advise against
redefining standard Unicode properties.  Choose other, unused terms 
if you must.  It is very confusing otherwise.

 Letters are all characters from the Alphabetic category, i.e.
 Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic.

Except that is exactly the definition of the Unicode Alphabetic property,
not the Unicode Letter property.  It is a mistake to equate
Letter=Alphabetic, and very confusing too.

I agree that this probably what you want, though.  I just don't think you
should use letter-related characters when there is an existing formal
definition that works, or that you should redefine Letter.

 letter-related characters are letters + marks (Mn, Mc, Me).

That isn't quite right.  

 * Letters are Lu+Ll+Lt+Lm+Lo.

 * Alphabetic is Letters + Other_Alphabetic.

 * Other_Alphabetic is certain marks (like the iota subscript) and the
   letter numbers (Nl), as well as a few symbols.

 * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.

I think you are looking for here are Word characters without 
Nd + Pc, so just Alphabetic + Mn+Mc+Me.  

Is that right?

--tom

PS: You can do union/intersection stuff with properties to see what
the resulting sets look like using the unichars command-line tool.

This is everything that is both alphabetic and also a mark:

% unichars -gs '\p{Alphabetic}' '\pM'
‭ ○ͅ  U+0345 GC=Mn SC=InheritedCOMBINING GREEK YPOGEGRAMMENI
‭ ○ְ  U+05B0 GC=Mn SC=Hebrew   HEBREW POINT SHEVA
‭ ○ֱ  U+05B1 GC=Mn SC=Hebrew   HEBREW POINT HATAF SEGOL
‭ ○ֲ  U+05B2 GC=Mn SC=Hebrew   HEBREW POINT HATAF PATAH
‭ ○ֳ  U+05B3 GC=Mn SC=Hebrew   HEBREW POINT HATAF QAMATS
...
‭ ○ं  U+0902 GC=Mn SC=Devanagari   DEVANAGARI SIGN ANUSVARA
‭ ः  U+0903 GC=Mc SC=Devanagari   DEVANAGARI SIGN VISARGA
‭ ा  U+093E GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN AA
‭ ि  U+093F GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN I
‭ ी  U+0940 GC=Mc SC=Devanagari   DEVANAGARI VOWEL SIGN II
‭ ○ु  U+0941 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN U
‭ ○ू  U+0942 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN UU
‭ ○ृ  U+0943 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC R
‭ ○ॄ  U+0944 GC=Mn SC=Devanagari   DEVANAGARI VOWEL SIGN VOCALIC RR
...

While these are the NON-alphabetic marks, which are still Word
characters though of course:

% unichars -gs '\P{Alphabetic}' '\pM'
‭ ○̀  U+0300 GC=Mn SC=InheritedCOMBINING GRAVE ACCENT
‭ ○́  U+0301 GC=Mn SC=InheritedCOMBINING ACUTE ACCENT
‭ ○̂  U+0302 GC=Mn SC=InheritedCOMBINING CIRCUMFLEX ACCENT
‭ ○̃  U+0303 GC=Mn SC=InheritedCOMBINING TILDE
‭ ○̄  U+0304 GC=Mn SC=InheritedCOMBINING MACRON
‭ ○̅  U+0305 GC=Mn SC=InheritedCOMBINING OVERLINE
‭ ○̆  U+0306 GC=Mn SC=InheritedCOMBINING BREVE
‭ ○̇  U+0307 GC=Mn SC=InheritedCOMBINING DOT ABOVE
‭ ○̈  U+0308 GC=Mn SC=InheritedCOMBINING DIAERESIS
‭ ○̉  U+0309 GC=Mn SC=InheritedCOMBINING HOOK ABOVE
‭ ○̊  U+030A GC=Mn SC=InheritedCOMBINING RING ABOVE
‭ ○̋  U+030B GC=Mn SC=InheritedCOMBINING DOUBLE ACUTE ACCENT
‭ ○̌  U+030C GC=Mn SC=InheritedCOMBINING CARON
...

And here are the Cased code points that are do not change when 
upper-, title-, or lowercased:

% unichars -gs '\p{Cased}' '[^\p{CWU}\p{CWT}\p{CWL}]'
‭ ª  U+00AA GC=Ll SC=LatinFEMININE ORDINAL INDICATOR
‭ º  U+00BA GC=Ll SC=LatinMASCULINE ORDINAL INDICATOR
‭ ĸ  U+0138 GC=Ll SC=LatinLATIN SMALL LETTER KRA
‭ ƍ  U+018D GC=Ll SC=LatinLATIN SMALL LETTER TURNED DELTA
‭ ƛ  U+019B GC=Ll SC=LatinLATIN SMALL LETTER LAMBDA WITH STROKE
‭ ƪ  U+01AA GC=Ll SC=LatinLATIN LETTER REVERSED ESH LOOP
‭ ƫ  U+01AB GC=Ll SC=LatinLATIN SMALL LETTER T WITH PALATAL HOOK
‭ ƺ  U+01BA GC=Ll SC=LatinLATIN SMALL LETTER EZH WITH TAIL
‭ ƾ  U+01BE GC=Ll SC=LatinLATIN LETTER INVERTED GLOTTAL STOP WITH 
STROKE
‭ ȡ

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-09-30 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti ezio.melo...@gmail.com added the comment:

 Leaving named sequences for unicodedata.lookup() only (and not for
 \N{}) makes sense.

There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues.  If the argument to unicode.lookup()
and be any of name, alias, or sequence, that seems ok.  \N{} should
still do aliases, though, since those don't have the complication that
sequences have.

You may wish unicode.name() to return the alias in preference, however.
That's what we do.  And of course, there is no issue of sequences there.

The rest of this perhaps painfully long message is just elaboration
and icing on what I've said above.

--tom

 The list of aliases is so small (11 entries) that I'm not sure using a
 binary search for it would bring any advantage.  Having a single
 lookup algorithm that looks in both tables doesn't work because the
 aliases lookup must be in _getcode for \N{...} to work, whereas the
 lookup of named sequences will happen in unicodedata_lookup
 (Modules/unicodedata.c:1187).  I think we can leave the for loop over
 aliases in _getcode and implement a separate (and binary) search in
 unicodedata_lookup for the named sequences.  Does that sound fine?

If you mean, is it ok to add just the aliases and not the named sequences to
\N{}, it is certainly better than not doing so at all.  Plus that way you do
*not* have to figure out what in the world to to do with [^a-c\N{sequence}],
since that would have be something like (?!\N{sequence})[^a-c]), which is 
hardly obvious, especially if \N{sequence} actually starts with [a-c].

However, because the one namespace comprises all three of names,
aliases, and named sequences, it might be best to have a functional
(meaning, non-regex) API that allows one to do a fetch on the whole
namespace, or on each individual component.

The ICU library supports this sort of thing.  In ICU4J's Java bindings, 
we find this:

static int getCharFromExtendedName(String name) 
   [icu] Find a Unicode character by either its name and return its code 
point value.
static int  getCharFromName(String name) 
   [icu] Finds a Unicode code point by its most current Unicode name and 
return its code point value.
static int  getCharFromName1_0(String name) 
   [icu] Find a Unicode character by its version 1.0 Unicode name and 
return its code point value.
static int  getCharFromNameAlias(String name) 
   [icu] Find a Unicode character by its corrected name alias and return 
its code point value.

The first one obviously has a bug in its definition, as the English
doesn't scan.  Looking at the full definition is even worse.  Rather
than dig out the src jar, I looked at ICU4C, but its own bindings are
completely different.  There you have only one function, with an enum to
say what namespace to access:

UChar32 u_charFromName  (   UCharNameChoice nameChoice, 
const char *name, 
UErrorCode *pErrorCode 
)

The UCharNameChoice enum tells what sort of thing you want:

U_UNICODE_CHAR_NAME,
U_UNICODE_10_CHAR_NAME,
U_EXTENDED_CHAR_NAME,
U_CHAR_NAME_ALIAS,  
U_CHAR_NAME_CHOICE_COUNT

Looking at the src for the Java is no more immediately illuminating, 
but I think that extended may refer to a union of the old 1.0 names 
with the current names.

Now I'll tell you what Perl does.  I do this not to say it is right,
but just to show you one possible strategy.  I also am in the middle
of writing about this for the Camel, so it is in my head.

Perl does not provide the old 1.0 names at all.  We don't have a Unicode
1.0 legacy to support, which makes this cleaner.  However, we do provide
for the names of the C0 and C1 Control Codes, because apart from Unicode
1.0, they don't condescend to name the ASCII or Latin1 control codes.  

We also provide for certain well known aliases from the Names file:
anything that says * commonly abbreviated as ..., so things like LRO
and ZWJ and such.

Perl makes no distinction between anything in the namespace when using
the \N{} form for string and regex escapes.  That means when you use
\N{...} or /\N{...}/, you don't know which it is, nor can you.
(And yes, the bracketed character class issue is annoying and unsolved.)

However, the functional API does make a slight distinction.  

 -- charnames::vianame() takes a name or alias (as a string) and returns a 
single 
integer code point.

eg: This therefore converts LATIN SMALL LETTER A into 0x61.
It also converts both 
BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
and 
BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
into 0x1D0C5.  See below.

 -- charnames::string_vianame() takes a string name, alias, *or* sequence, 
and gives back a string.   

eg

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Mon, 19 Sep 2011 11:11:48 -: 

 We could also look at what other languages do and/or ask to the
 Unicode consortium.

I will look at what Java does a bit later on this morning, which is the
only other commonly used language besides C that I feel even reasonably
competent at.  I seem to recall that Java changed its default behavior on
certain Unicode decoding issues from warnings to exceptions between one
release and the next, but can't remember any details.

As the Perl Foundation is a member of the Unicode Consortium and I am on
the mailing list, I suppose I could just ask them.  I feel a bit timid
though because the last thing I brought up there was based on a subtle
misunderstanding of mine regarding the IDC and Pattern_Syntax properties.
I hate looking dumb twice in row. :)

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

No good news on the Java front.  They do all kinds of things wrong.  
For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 
input stream, which is illegal.  There's more they do wrong, including 
in their documentation, but I won't bore you with their errors.

I'm going to seek clarification on some matters here.

--tom

--
title: Python lib re cannot handle Unicode properly due to  narrow/wide bug 
- Python lib re cannot handle Unicode properly due to narrow/wide bug

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

It appears that I'm right about surrogates, but wrong about
noncharacters.  I'm seeking a clarification there.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-18 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy rep...@bugs.python.org wrote
   on Thu, 08 Sep 2011 18:56:11 -: 

On 9/8/2011 4:32 AM, Ezio Melotti wrote:

 So to summarize a bit, there are different possible level of strictness:
1) all the possible encodable values, including the ones10;
2) values in range 0..10;
3) values in range 0..10 except surrogates (aka scalar values);
4) values in range 0..10 except surrogates and noncharacters;

 and this is what is currently available in Python:
1) not available, probably it will never be;
2) available through the 'surrogatepass' error handler;
3) default behavior (i.e. with the 'strict' error handler);
4) currently not available.

 Now, assume that we don't care about option 1 and want to implement the 
 missing option 4 (which I'm still not 100% sure about).  The possible 
 options are:
* add a new codec (actually one for each UTF encoding);
* add a new error handler that explicitly disallows noncharacters;
* change the meaning of 'strict' to match option 4;

 If 'strict' meant option 4, then 'scalarpass' could mean option 3. 
 'surrogatepass' would then mean 'pass surragates also, in addition to 
 non-char scalers'.

I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
to reject both surrogates *and* noncharacters. Here's something from the
published Unicode Standard's p.24 about noncharacter code points:

• Noncharacter code points are reserved for internal use, such as for 
  sentinel values. They should never be interchanged. They do, however,
  have well-formed representations in Unicode encoding forms and survive
  conversions between encoding forms. This allows sentinel values to be
  preserved internally across Unicode encoding forms, even though they are
  not designed to be used in open interchange.

And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:

C2 A process shall not interpret a noncharacter code point as an 
   abstract character.

• The noncharacter code points may be used internally, such as for 
  sentinel values or delimiters, but should not be exchanged publicly.

I'd have to check the fine print, but I am pretty sure that shall not 
is an imperative form.  We have understand that to read that a comforming
process *must*not* do that.  It's because of that wording that in Perl,
using either of {en,de}code() with any of the UTF-{8,16,32} encodings,
including the LE/BE versions as appropriate, it will not produce nor accept
a noncharacter code point like FDD0 or FFFE.

Do you think we may perhaps have misread that conformance clause?

Using Perl's special, loose-fitting utf8 encoding, you can get it do
noncharacter code points and even surrogates, but you have to suppress
certain things to make that happen quietly.  You can only do this with
utf8, not any of the UTF-16 or UTF-32 flavors.  There we give them no 
choice, so you must be strict.  I agree this is not fully orthogonal.

Note that this is the normal thing that people do:

binmode(STDOUT, :utf8);

which is the *loose* version.  The strict one is utf8-strict or UTF-8:

open(my $fh,  :encoding(UTF-8), $pathname)

So it is a bit too easy to get the loose one.  We felt we had to do this
because we were already using the loose definition (and allowing up to
chr(2**32) etc) when the Unicode Consortium made clear what sorts of
things must not be accepted, or perhaps, before we made ourselves clear
on this.  This will have been back in 2003, when I wasn't paying very
close attention.

I think that just like Perl, Python has a legacy of the original loose
definition.  So some way to accommodate that legacy while still allowing
for a comformant application should be devised.  My concern with Python
is that people tend to make they own manual calls to encode/decode a lot
more often than they do in Perl.  That people that if you only catch it
on a stream encoding, you'll miss it, because they will use binary I/O
and miss the check.

--tom

Below I show a bit of how this works in Perl.  Currently the builtin
utf8 encoding is controlled somewhat differently from how the Encode
module's encode/decode functions are.  Yes, this is not my idea of good.

This shows that noncharacters and surrogates do not survive the
encoding/decoding process for UTF-16:

% perl -CS -MEncode -wle 'print decode(UTF-16, encode(UTF-16, 
chr(0xFDD0)))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -CS -MEncode -wle 'print decode(UTF-16, encode(UTF-16, 
chr(0xFFFE)))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -CS -MEncode -wle 'print decode(UTF-16, encode(UTF-16, 
chr(0xD800)))' | uniquote -v
UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.

If you pass a third argument to encode/decode, you can

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-07 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sat, 03 Sep 2011 00:28:03 -: 

 Ezio Melotti ezio.melo...@gmail.com added the comment:

 Or they are still called UTF-8 but used in combination with different error
 handlers, like surrogateescape and surrogatepass.  The plain UTF-* codecs
 should produce data that can be used for open interchange, rejecting all the
 invalid data, both during encoding and decoding.

 Chapter 03, D79 also says:

To ensure that the mapping for a Unicode encoding form is one-to-one,
all Unicode scalar values, including those corresponding to
noncharacter code points and unassigned code points, must be mapped to
unique code unit sequences. Note that this requirement does not extend
to high-surrogate and low-surrogate code points, which are excluded by
definition from the set of Unicode scalar values.

 and this seems to imply that the only unencodable codepoint are the non-scalar
 values, i.e. surrogates and codepoints U+10.  Noncharacters shouldn't
 thus receive any special treatment (at least during encoding).

 Tom, do you agree with this?  What does Perl do with them?

I agree that one needs to be able to encode any scalar value and
store it in memory in a designated character encoding form.

This is different from streams, though.

The 3 different Unicode character encoding *forms* -- UTF-8,
UTF-16, and UTF-32 -- certainly need to support all possible
scalar values.  These are the forms used to store code points in
memory.  They do not have BOMs, because one knows one's memory
layout.   These are specifically allowed to contain the
noncharacters:

http://www.unicode.org/reports/tr17/#CharacterEncodingForm

The third type is peculiar to the Unicode Standard: the noncharacter.
This is a kind of internal-use user-defined character, not intended for
public interchange.

The problem is that one must make a clean distinction between character
encoding *forms* and character encoding *schemes*.

http://www.unicode.org/reports/tr17/#CharacterEncodingScheme

It is important not to confuse a Character Encoding Form (CEF) and a CES.

1. The CEF maps code points to code units, while the CES transforms
   sequences of code units to byte sequences.
2. The CES must take into account the byte-order serialization of
   all code units wider than a byte that are used in the CEF.
3. Otherwise identical CESs may differ in other aspects, such as the
   number of user-defined characters allowed.

Some of the Unicode encoding schemes have the same labels as the three
Unicode encoding forms. [...]

As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for
example the serialized bytes for streaming data or in files; they may have
either byte orientation, and a single BOM may be present at the start of the
data. When the usage of the abbreviated designators UTF-16 or UTF-32 might
be misinterpreted, and where a distinction between their use as referring to
Unicode encoding forms or to Unicode encoding schemes is important, the full
terms should be used. For example, use UTF-16 encoding form or UTF-16
encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
respectively.

The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

* UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs.

* UTF-16 and UTF-32 are compound CESs, consisting of an single, optional
  byte order mark at the start of the data followed by a simple CES.

I believe that what this comes down to is that you can have noncharacters in 
memory
as a CEF, but that you cannot have them in a CES meant for open interchange.
And what you do privately is a different, third matter.

What Perl does differs somewhat depending on whether you are just playing
around with encodings in memory verus using streams that have particular
encodings associated with them.  I belive that you can think of this as the
first being for CEF stuff and the second is for CES stuff.

Streams are strict.  Memory isn't.

Perl will never ever produce nor accept one of the 66 noncharacers on any
stream marked as one of the 7 character encoding schemes.  However, we
aren't always good about whether we generate an exception or whether we
return replacement characters.  

Here the first process created a (for the nonce, nonfatal) warning, 
whereas the second process raised an exception:

 %   perl -wle 'binmode(STDOUT, encoding(UTF-16))|| die; print 
chr(0xFDD0)' | 
 perl -wle 'binmode(STDIN, encoding(UTF-16))||die; print ord STDIN'
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
UTF-16:Unicode character fdd0 is illegal at -e line 1.
Exit 255

Here the first again makes a warning

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-29 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote
   on Mon, 29 Aug 2011 13:21:06 -: 

 It's not only typographically speaking, it's really a spelling error,
 even in hand-written text :-)

Sure, and so too is omitting an accent mark or diaeresis.  But—alas!—you’ll
never convince most monoglot anglophones of that, the ones who keep wanting to
strip them from résumé, façade, châteaux, crème brûlée, fête, tête-à-tête, 
à la française, or naïveté, not to mention José, jalapeño, the erstwhile
American Secretary of State Federico Peña, or nearby Cañon City, Colorado, 
where I have family.  I think œnonlogy has survived solely on its rarity, 
and the Encyclopædia Britannica is that way because the ligat(ur)ed letter
is in their actual trademark.

Cell phone users sending text messages have long suffered the grievous
injuries to their language(s) that naked ASCII imparts, but this is
nothing like the crossdressing nightmare called Greeklish, also variously
known as Grenglish, Latinoellinika/Λατινοελληνικά, or ASCII Greek.

http://en.wikipedia.org/wiki/Greeklish

[...] The reason for this is the fact that text written in Greeklish
is considerably less aesthetically pleasing, and also much harder to
read, compared to text written in the Greek alphabet. A non-Greek
speaker/reader can guess this by this example: δις ιζ χαρντ του
ριντ would be the way to write this is hard to read in English
but utilizing the Greek alphabet.

I especially enjoy  George Baloglou’s Byzantine Grenglish, wherein:

Ὀδυσσεύς= Oducceusinstead of Odysseus
Ἀχιλλεύς= Axilleusinstead of Achilleus
Σίσυφος = Sicuphosinstead of Sisyphus
Περικλῆς= 5epiklhsinstead of Pericles
Χθονός  = X8onos  instead of Chthonos
 Οι Ατρείδες= Oi Atpeides instead of the Atreïdes

Terrible though the depredations upon the French language that may
have been committed by ASCII, surely these go even further. :)

--tom

Η ΙλιάδαH Iliada

Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλῆος   Mhnin aeide, 8ea, 5hlhiadeo 
Axilhos
οὐλομένην, ἣ μυρί’ Ἀχαιοῖς ἄλγε’ ἔθηκε,   oulomenhn, 'h mupi’ Axaiois alge’ 
e8hke,
πολλὰς δ’ ἰφθίμους ψυχὰς Ἄϊδι προῒαψενnollas d’ iph8imous yuxas Aidi 
npoiayen
ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν'hpoon, autous de elopia teuxe 
kuneccin
οἰωνοῖσί τε πᾶσι· Διὸς δ’ ἐτελείετο βουλή·oionoici te naci· Dios d’ 
eteleieto boulh·
ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε  eks o'u dh ta npota diacththn 
epicante
Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.Atpeidhs te anaks andpon kai dios 
Axilleus.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-28 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote on Sat, 27 Aug 2011 20:04:56 
-: 

 Neither am I.  Even in old-style English with ae and oe, one wrote
 ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
 *Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

 Trying to disprove you a bit:
 http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
 http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
 http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg

 but classical typographies seem to write either the uppercase Œ or the
 lowercase œ.

That's what I meant: one only ever sees œufs or ŒUFS, never OEUFS.
French doesn't fit into ISO 8859-1.  That's one of the changes to
ISO-8859-15 compared with ISO-8859-1 (and Unicode):

iso-8859-1   A4  ⇔  U+00A4  ＜ ¤ ＞  \N{CURRENCY SIGN}
iso-8859-15  A4  ⇒  U+20AC  ＜ € ＞  \N{EURO SIGN}

iso-8859-1   A6  ⇔  U+00A6  ＜ ¦ ＞  \N{BROKEN BAR}
iso-8859-15  A6  ⇒  U+0160  ＜ Š ＞  \N{LATIN CAPITAL LETTER S WITH CARON}

iso-8859-1   A8  ⇔  U+00A8  ＜ ¨ ＞  \N{DIAERESIS}
iso-8859-15  A8  ⇒  U+0161  ＜ š ＞  \N{LATIN SMALL LETTER S WITH CARON}

iso-8859-1   B4  ⇔  U+00B4  ＜ ´ ＞  \N{ACUTE ACCENT}
iso-8859-15  B4  ⇒  U+017D  ＜ Ž ＞  \N{LATIN CAPITAL LETTER Z WITH CARON}

iso-8859-1   B8  ⇔  U+00B8  ＜ ¸ ＞  \N{CEDILLA}
iso-8859-15  B8  ⇒  U+017E  ＜ ž ＞  \N{LATIN SMALL LETTER Z WITH CARON}

iso-8859-1   BC  ⇔  U+00BC  ＜ ¼ ＞  \N{VULGAR FRACTION ONE QUARTER}
iso-8859-15  BC  ⇒  U+0152  ＜ Œ ＞  \N{LATIN CAPITAL LIGATURE OE}

iso-8859-1   BD  ⇔  U+00BD  ＜ ½ ＞  \N{VULGAR FRACTION ONE HALF}
iso-8859-15  BD  ⇒  U+0153  ＜ œ ＞  \N{LATIN SMALL LIGATURE OE}

iso-8859-1   BE  ⇔  U+00BE  ＜ ¾ ＞  \N{VULGAR FRACTION THREE QUARTERS}
iso-8859-15  BE  ⇒  U+0178  ＜ Ÿ ＞  \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}

 That said, I wonder why Unicode even includes ligatures like ﬀ. Sounds
 like mission creep to me (and horrible annoyances for people like us).

I'm pretty sure that typographic ligatures are there for roundtripping
with legacy encodings.  I believe that œ/Œ is the only code point
with ligature in its name that you're supposed to still use, and
that all others should be figured out by modern fonting software.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-27 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Sat, 27 Aug 2011 03:26:21 -: 

 To me, making (default) iteration deviate from indexing is anathema.

So long is there's a way to interate through a string some other way
that by code unit, that's fine.  However, the Java way of 16-bit code
units is so annoying because there often aren't code point APIs, and 
so you get a lot of niggling errors creeping in.  This is part of why
I strongly prefer wide builds, so that code point and code unit are the
same thing again.

 However, there is nothing wrong with providing a library function that
 takes a string and returns an iterator that iterates over code points,
 joining surrogate pairs as needed. You could even have one that
 iterates over characters (I think Tom calls them graphemes), if that
 is well-defined and useful.

Character can sometimes be a confusing term when it means something
different to us programmers as it does to users.  Code point to mean the
integer is a lot clearer to us but to no one else.  At work I often just
give in and go along with the crowd and say character for the number that
sits in a char or wchar_t or Character variable, even though of course
that's a code point.  I only rebel when they start calling code units 
characters, which (inexperienced) Java people tend to do, because that
leads to surrogate splitting and related errors.

By grapheme I mean something the user perceives as a single character.  In
full Unicodese, this is an extended grapheme cluster.  These are code point
sequences that start with a grapheme base and have zero or more grapheme
extenders following it.  For our purposes, that's *mostly* like saying you
have a non-Mark followed by any number of Mark code points, the main
excepting being that a CR followed by a LF also counts as a single grapheme
in Unicode.

If you are in an editor and wanted to swap two characters, the one 
under the user's cursor and the one next to it, you have to deal with
graphemes not individual code points, or else you'd get the wrong answer.
Imagine swapping the last two characters of the first string below,
or the first two characters of second one:

contrôléecontro\x{302}le\x{301}e
élèvee\x{301}le\x{300}ve

While you can sometimes fake a correct answer by considering things
in NFC not NFD, that's doesn't work in the general case, as there
are only a few compatibility glyphs for round-tripping for legacy
encodings (like ISO 8859-1) compared with infinitely many combinations
of combining marks.  Particularly in mathematics and in phonetics, 
you often end up using marks on characters for which no pre-combined
variant glyph exists.  Here's the IPA for a couple of Spanish words
with their tight (phonetic, not phonemic) transcriptions:

anécdota[a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
rincón  [rĩŋˈkõ̞n]

NFD:
ane\x{301}cdota
[a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
rinco\x{301}n  [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

NFD:
an\x{E9}cdota
[a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
rinc\x{F3}n  [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

So combining marks don't just go away in NFC, and you really do have to
deal with them.  Notice that to get the tabs right (your favorite subject :),
you have to deal with print widths, which is another place that you get
into trouble if you only count code points.

BTW, did you know that the stress mark used in the phonetics above
is actually a (modifier) letter in Unicode, not punctuation?

# uniprops -a 2c8
U+02C8 ‹ˈ› \N{MODIFIER LETTER VERTICAL LINE}
\w \pL \p{L_} \p{Lm}
All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters 
Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph 
GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print 
Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum 
X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON 
Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR 
Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral 
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX 
Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA 
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U 
Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None 
NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
Present_In=6.0 IN=6.0 SC=Zyyy

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
on Fri, 26 Aug 2011 21:11:24 -:

Would this also affect .islower() and friends?

SHORT VERSION: (7 lines)

I don't believe so, but the relationship between lower() and islower()
is not as clear to me as I would have thought, and more importantly,
the code and the documentation for Python's islower() etc currently seem
to disagree. For future releases, I recommend fixing the code, but if
compatibility is an issue, then perhaps for previous releases still in
maintenance mode fixing only the documentation would possibly be good
enough--your call.

===

MEDIUM VERSION: (87 lines)

I was initially confused with Python's islower() family because of the way
they are defined to operate on full strings. They don't check that
everything is lowercase even though they say they do.

http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range

str.lower()

Return a copy of the string with all the cased characters [4]
converted to lowercase.

str.islower()

Return true if all cased characters [4] in the string are lowercase
and there is at least one cased character, false otherwise.

[4] (1, 2, 3, 4) Cased characters are those with general category
property being one of “Lu” (Letter, uppercase), “Ll” (Letter,
lowercase), or “Lt” (Letter, titlecase).

This is strange in several ways. Of lesser importance is that
strings can be considered lowercase even if they don't match

^\p{lowercase}+$

Another is that the result of calling str.lower() may not be .islower().
I'm not sure what these are particularly for, since I myself would just use
a regex to get finer-grained control. (I suppose that's because re doesn't
give access to the Unicode properties needed that this approach never
gained any traction in the Python community.)

However, the worst of this is that the documentation defines both cased
characters and lowercase characters *differently* from how Unicode does
defines those very same terms. This was quite confusing.

Unicode distinguishes Cased code points from Cased_*Letter* code points.
Python is using the Cased_Letter property but calling it Cased. Cased in
a proper superset of Cased_Letter. From the DerivedCoreProperties file in
the Unicode Character Database:

# Derived Property: Cased (Cased)
# As defined by Unicode Standard Definition D120
# C has the Lowercase or Uppercase property or has a General_Category
value of Titlecase_Letter.

In the same way, the Lowercase and Uppercase properties are not the same as
the Lowercase_*Letter* and Uppercase_*Letter* properties. Rather, the former
are respectively proper supersets of the latter.

# Derived Property: Lowercase
# Generated from: Ll + Other_Lowercase

[...]

# Derived Property: Uppercase
# Generated from: Lu + Other_Uppercase

In all these, you almost always want the superset versions not the
restricted subset versions you are using. If it were in the regex engine,
the user could select either.

Java used to miss all these, too. But in 1.7, they updated their character
methods to use the properties that they'd all along said they were using:

http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)

public static boolean isLowerCase(char ch)
Determines if the specified character is a lowercase character.

A character is lowercase if its general category type, provided by
Character.getType(ch), is LOWERCASE_LETTER, or it has contributory
- property Other_Lowercase as defined by the Unicode Standard.

Note: This method cannot handle supplementary characters. To
support all Unicode characters, including supplementary
characters, use the isLowerCase(int) method.

(And yes, that's where Java uses character to mean code unit
not code point, alas. No wonder people get confused)

I'm pretty sure that Python needs to either update its documentation to
match its code, update its code to match its documentation, or both. Java
chose to update the code to match the documentation, and this is the course
I would recommend if at all possible. If you say you are checking for
cased code points, then you should use the Unicode definition of cased code
points not your own, and if you say you are checking for lowercase code
points, then you should use the Unicode definition not your own. Both of
these require access to contributory properties from the UCD and not
just general categories alone.

--tom

===

LONG VERSION: (222 lines)

Essential tools I use for inspecting Unicode code points and their
properties include

http://training.perl.com/scripts

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-27 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Sat, 27 Aug 2011 16:15:33 -: 

 Although personally I don't have much of an intuition for what
 titlecase means (and why it's important), perhaps because I'm not
 familiar with any language where there is a third case for some
 letters.

Neither am I.  Even in old-style English with ae and oe, one wrote
ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
*Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

(BTW, in French you really shouldn't split up the œ into oe, 
  nor in Old English, Old Norse, or Icelandic the æ in ae;
  although in contemporary English, it's usually ok to do so.)

I believe that almost but not quite all the sticky situations with
Unicode casing involve compatibility characters for clean round-trips
with legacy encodings.  Exceptions include the German sharp s (both of 
them now) and the two Greek lowercase sigmas.  Thank goodness we don't
use the long s in English anymore.  What is it with s's, anyway? :)

Most of the titlecase letters are in Greek, with a few in Armenian.
I know no Armenian (their letters all look the same to me :), and the
folks I talked to about the Greek are skeptical.  The German sharp s is
a red herring, because you can never have it as the first letter
(although it needn't be the last, as in Rußland).  That's no more
possible than having the old legacy ﬀ ligature appear at the beginning
of an English world.

In any event, there are only 129 total code points that are
problematic in terms of their case, where by problematic 
I mean one or more of:

   --- titlecase differs from uppercase
   --- foldcase  differs from lowercase
   --- any of fold/lower/title/uppercase yields more than one code point

Of all these, it's the (now two!) sharp s's and the Turkic i that are the most 
annoying.
It's really quite a lot of trouble to go through for so few code points of so 
little
(perceived) use.  But I suppose you never know what new ones they'll uncover, 
either.
Here are those 129 case-problematicals arranged in UCA order.  Some of these
normilizations forms that decompose into graphemes with four code points (not 
shown).
There are a few other oddities, like the Kelvin sign and other singletons, 
but these
are most of the trouble. They're all in the BMP; I guess we learned our lesson. 
:)

--tom

  1: U+0345 ○ͅ  COMBINING  GREEK YPOGEGRAMMENI
   fc=ι  U+3B9 lc=○ͅ  U+345 tc=Ι  U+399 uc=Ι  U+399 
  2: U+1E9A ẚ  LATIN SMALL LETTER A WITH RIGHT HALF RING
   fc=aʾ  U+61.2BE lc=ẚ  U+1E9A tc=Aʾ  U+41.2BE uc=Aʾ  U+41.2BE 
  3: U+01F3 ǳ  LATIN SMALL LETTER DZ
   fc=ǳ  U+1F3 lc=ǳ  U+1F3 tc=ǲ  U+1F2 uc=Ǳ  U+1F1 
  4: U+01F2 ǲ  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
   fc=ǳ  U+1F3 lc=ǳ  U+1F3 tc=ǲ  U+1F2 uc=Ǳ  U+1F1 
  5: U+01F1 Ǳ  LATIN CAPITAL LETTER DZ
   fc=ǳ  U+1F3 lc=ǳ  U+1F3 tc=ǲ  U+1F2 uc=Ǳ  U+1F1 
  6: U+01C6 ǆ  LATIN SMALL LETTER DZ WITH CARON
   fc=ǆ  U+1C6 lc=ǆ  U+1C6 tc=ǅ  U+1C5 uc=Ǆ  U+1C4 
  7: U+01C5 ǅ  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
   fc=ǆ  U+1C6 lc=ǆ  U+1C6 tc=ǅ  U+1C5 uc=Ǆ  U+1C4 
  8: U+01C4 Ǆ  LATIN CAPITAL LETTER DZ WITH CARON
   fc=ǆ  U+1C6 lc=ǆ  U+1C6 tc=ǅ  U+1C5 uc=Ǆ  U+1C4 
  9: U+FB00 ﬀ  LATIN SMALL LIGATURE FF
   fc=ff  U+66.66 lc=ﬀ  U+FB00 tc=Ff  U+46.66 uc=FF  U+46.46 
 10: U+FB03 ﬃ  LATIN SMALL LIGATURE FFI
   fc=ffi  U+66.66.69 lc=ﬃ  U+FB03 tc=Ffi  U+46.66.69 uc=FFI  
U+46.46.49 
 11: U+FB04 ﬄ  LATIN SMALL LIGATURE FFL
   fc=ffl  U+66.66.6C lc=ﬄ  U+FB04 tc=Ffl  U+46.66.6C uc=FFL  
U+46.46.4C 
 12: U+FB01 ﬁ  LATIN SMALL LIGATURE FI
   fc=fi  U+66.69 lc=ﬁ  U+FB01 tc=Fi  U+46.69 uc=FI  U+46.49 
 13: U+FB02 ﬂ  LATIN SMALL LIGATURE FL
   fc=fl  U+66.6C lc=ﬂ  U+FB02 tc=Fl  U+46.6C uc=FL  U+46.4C 
 14: U+1E96 ẖ  LATIN SMALL LETTER H WITH LINE BELOW
   fc=ẖ  U+68.331 lc=ẖ  U+1E96 tc=H̱  U+48.331 uc=H̱  U+48.331 
 15: U+0130 İ  LATIN CAPITAL LETTER I WITH DOT ABOVE
   fc=i̇  U+69.307 lc=i̇  U+69.307 tc=İ  U+130 uc=İ  U+130 
 16: U+01F0 ǰ  LATIN SMALL LETTER J WITH CARON
   fc=ǰ  U+6A.30C lc=ǰ  U+1F0 tc=J̌  U+4A.30C uc=J̌  U+4A.30C 
 17: U+01C9 ǉ  LATIN SMALL LETTER LJ
   fc=ǉ  U+1C9 lc=ǉ  U+1C9 tc=ǈ  U+1C8 uc=Ǉ  U+1C7 
 18: U+01C8 ǈ  LATIN CAPITAL LETTER L WITH SMALL LETTER J
   fc=ǉ  U+1C9 lc=ǉ  U+1C9 tc=ǈ  U+1C8 uc=Ǉ  U+1C7 
 19: U+01C7 Ǉ  LATIN CAPITAL LETTER LJ
   fc=ǉ  U+1C9 lc=ǉ  U+1C9 tc=ǈ  U+1C8 uc=Ǉ  U+1C7 
 20: U+01CC ǌ  LATIN SMALL LETTER NJ
   fc=ǌ  U+1CC lc=ǌ  U+1CC tc=ǋ  U+1CB uc=Ǌ  U+1CA 
 21: U+01CB ǋ  LATIN CAPITAL LETTER N WITH SMALL LETTER J
   fc=ǌ  U+1CC lc=ǌ  U+1CC tc=ǋ  U+1CB uc=Ǌ  U+1CA 
 22: U+01CA Ǌ  LATIN CAPITAL LETTER NJ
   fc=ǌ  U+1CC lc=ǌ  U+1CC tc=ǋ  U+1CB uc=Ǌ  U

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Sounds like a fair feature request for Python 3.3, as long as the
 intention is that users must import some module from the standard
 library and use functions defined in that module.  The operations and
 methods defined for str instances (e.g. ==, , etc.) should not change
 their behavior.

 Is there an existing 3rd party library that we could adopt (even if it isn't 
 perfect yet)?

I *think* you could use ICU's.  

I'm pretty sure the Parrot people use ICU libraries.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12735
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Fri, 26 Aug 2011 21:16:57 -: 

 Yeah, this should be fixed in 3.3 and probably backported to 3.2
 and 2.7.  (There is already no guarantee that len(s) ==
 len(s.title()), right?)

Well, *I* don't know of any such guarantee, 
but I don't know Python very well.

In general, Unicode makes very few guarantees about casing.  Under full
casemapping, which is the only way to do the silly Turkish stuff amongst
quite a bit else, any of the three casemappings can change the length of
the string.

Other things you can't rely on are round tripping and single paths.  By
roundtripping, just look at the two lowercase sigmas and think about how
you can't get back to one of them if you uppercase them both.  By single
paths, I mean that code that does some sort of conversion where it first
lowercases everything and then titlecases the first letter can produce
something different from titlecasing just the original first letter and
then lowercasing the rest of them.  That's because tc(x) and tc(lc(x)) can
be different.

--tom

--
title: str.title()  is overzealous by upcasing combining marks inappropriately 
- str.title() is overzealous by upcasing combining marks inappropriately

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12737
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Raymond Hettinger raymond.hettin...@gmail.com added the comment:

 I would like to be involved in the design of the API for a UCA module
 and its routines for loading Unicode Collation Element Tables (not
 making the mistake of using global state like the locale module does).

Is this the problem where a locale is global to a process (or thread)?

The way I'm used to using the UCA module in Perl, that's never a problem,
because it's completely object-oriented.  There's no global state.  You 
instantiate a collator object with all the state it needs, like

collation_level
upper_before_lower
backwards_levels
normalization
override_CJK
override_Hangul
katakana_before_hiragana
variable
locale
preprocess

And then you use that object for all your collation needs, including
not just sorting but also string comparison and even searches.

For example, you could instantiate a first collator object with its level
set to one, meaning just compare base alphanumerics not diacritics or case
or nonletters, and a second with the defaults so that it uses all four
levels or a different normalization.  I have on occasion had more than one
collator object around at once each with its own locale, like if I want to
compare different locales' comparisons.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12735
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I should probably mention the importance in the design of a UCA module of
being able to specify which UCA version number you want it to behave like
in case you plan to override some of the DUCET entries.  That way if you
run under a later UCA with different DUCET weights, your own tailorings will
still make sense.  If you don't do this, your collation tailorings can break 
in a new release of the UCA.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12735
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12735] request full Unicode collation support in std python library

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Fri, 26 Aug 2011 21:55:03 -: 

 I know I sound like NIH, but I'm always reluctant to add a big 3rd
 party lib like ICU to the permanent dependencies of all future Python
 distros.  If people want to use ICU they already can.  OTOH I don't
 have a better idea. :-(

I know exactly what you mean.  I would not want to push that on anyone,
being dependent on a gigantic 3rd-party module.  I just tried to answer
the question.  The only two full UCA implementations I know of are ICU's
and Perl's, which does not use ICU (since we're UTF-8, etc).

I just wish Python had Unicode collation, is all.

--tom

PS: (I haven't had good luck the ICU bindings in 3.2.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12735
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Guido van Rossum rep...@bugs.python.org wrote
   on Fri, 26 Aug 2011 21:11:24 -: 

 Guido van Rossum gu...@python.org added the comment:

 I presume this applies to builtin str methods like .lower(), right?  I
 think it is a good thing to do for Python 3.3.

Yes, the full casemaps are for upper, title, and lowercase.  There is 
also a full casefold and turkic case fold (which is full), but you
don't have a casefold function so I guess that doesn't matter.

 We'd need to define what should happen in edge cases, e.g. when
 (against all odds) a string happens to contain a lone surrogate or
 some other code point or sequence of code points that the Unicode
 standard considers illegal.  I think it should not fail but just leave
 those code points alone.

Well, it's a funny thing.  There are properties given for all
Unicode code points, even noncharacter code points.  This
includes the casing properties, oddly enough.

From UnicodeData.txt, which has a few surrogate entries; notice
no casing is given:

D800;Non Private Use High Surrogate, First;Cs;0;L;N;
DB7F;Non Private Use High Surrogate, Last;Cs;0;L;N;
DB80;Private Use High Surrogate, First;Cs;0;L;N;
DBFF;Private Use High Surrogate, Last;Cs;0;L;N;
DC00;Low Surrogate, First;Cs;0;L;N;
DFFF;Low Surrogate, Last;Cs;0;L;N;

And in SpecialCasing.txt, which does not have surrogates but does have
a default clause:

# This file is a supplement to the UnicodeData file.
# It contains additional information about the casing of Unicode characters.
# (For compatibility, the UnicodeData.txt file only contains case mappings 
for
# characters where they are 1-1, and independent of context and language.
# For more information, see the discussion of Case Mappings in the Unicode 
Standard.
#
# All code points not listed in this file that do not have a simple case 
mappings
# in UnicodeData.txt map to themselves.

And in CaseFolding.txt, which also does not have surrogates but again does 
have a default clause:

# The data supports both implementations that require simple case foldings
# (where string lengths don't change), and implementations that allow full 
case folding
# (where string lengths may grow). Note that where they can be supported, 
the
# full case foldings are superior: for example, they allow MASSE and 
Maße to match.
#
# All code points not listed in this file map to themselves.

Taken all together, it follows that the surrogates have case{map,fold}s
back to themselves, since they have no case{map,fold}s listed.

It's ok to have arbitrary code points in memory, including surrogates and
the 66 noncharacters.  It just isn't legal to have them in a UTF stream
for open interchange, whatever that means.  

 Does this require us to import more data files from the Unicode
 standard?  By itself that doesn't scare me.

One way or the other, yes, notably the SpecialCasing file for
casemapping and the CaseFolding file for casefolding (which you
should do anyway to fix re.I).  But you can and should process the
new files into some tighter format optimized for your own lookups.

Oddly, Java doesn't provide for String methods that do full casing on
titlecase, even those they do do so on lowercase and uppercase.  On
titlecase they only expose the simple casemaps via the Character class,
which are the ones from UnicodeData.  They recognize that this is flaw, 
but it was too late to fix it for JAva 7.

 Would this also affect .islower() and friends?

Well, it shouldn't, but .islower() and friends are already mistaken.
They seem to be checking for GC=Ll and such, but they need to be
checking the Unicode binary property Lowercase and such.  Watch:

test 37 for string Ⅷ
wanted ⅷ to be lowercase of Ⅷ but python disagrees
wanted Ⅷ to be titlecase of Ⅷ but python disagrees
wanted Ⅷ to be uppercase of Ⅷ but python disagrees
test 37 failed 3 subtests

test 39 for string Ⓚ
wanted ⓚ to be lowercase of Ⓚ but python disagrees
wanted Ⓚ to be titlecase of Ⓚ but python disagrees
wanted Ⓚ to be uppercase of Ⓚ but python disagrees
test 39 failed 3 subtests

That's because the Roman numerals are GC=Nl but still have
case and change case.  Similarly for the circled letters which
are GC=So but have case and change case.  Plus there's U+0345,
the iota subscript, which is GC=Mn but has case and changes case.

I don't remember whether I've sent in my full test suite or not.  
If I haven't yet, I should attach it to the bug report.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-26 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Here’s my casing test suite; I thought I sent it in but the mux file here isn’t 
the full thing.

 It does several things, including letting you run it with regex vs re.  It 
also checks for the islower, etc functions. It has both simple and full (and 
turkic) maps and folds in it, but is configured to only check the simple 
versions for now.  The islower and isupper etc functions seem to be checking 
the wrong Unicode property.

Yes, it has my quaint Unixisms in it, because it needs to run with UTF-8 
output, or you can't read what's going on.

--
Added file: http://bugs.python.org/file23051/casing-tests.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy rep...@bugs.python.org wrote
   on Fri, 19 Aug 2011 22:50:58 -: 

 My current opinion is that adding the aliases might be done in current
 releases. It certainly would serve the any user who does not know to
 misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Yes, I think the 11 aliases pose no problem.  It's amazing the trouble
you get into from having a fat-fingered amanuesis typing your laws 
into indelible stone tablets.

 Adding named sequences is definitely a feature request. The definition
 of .lookup(name) would be enlarged to Look up character by name,
 alias, or named sequence with reference to the specific files. The
 meaning of \N{} would also have to be enlarged.

But these do.  The problem is bracketed character classes.  
Yes, if you got named reference into the regex compiler as a raw
string, it could in theory rewrite

[abc\N{seq}] 

as 

(?:[abc]|\N{seq})

but that doesn't help if the sequence got replaced as a string escape.
At which point you have different behavior in the two lookalike cases.

If you ask how we do this in Perl, the answer is poorly.  It really only
works well in strings, not charclasses, although there is a proposal to do
a rewrite during compilation like I've spelled out above.  Seems messy for
something that might(?) not get much use.  But it would be nice for \N{} to
work to access the whole namespace without prejudice.  I have a feeling
this may be a case of trying to keep one's cake and eating it too, as
the two goals seem to rule each other out.

 If you look at the ICU UCharacter class, you can see that they provide a 
 more

 More what ;-)

More expressive set of lookup functions where it is clear which thing
you are getting.  I believe the ICU regexes only support one-char returns
for \N{...}, not multis per the sequences.  But I may not be looking
at the right docs for ICU; not sure.

 I presume ICU =International Components for Unicode, icu-project.org/
 Offers a portable set of C/C++ and Java libraries for Unicode support,
 software internationalization (I18N) and globalization (G11N). [appears
 to be free, open source, and possibly usable within Python]

Well, there are some Python bindings for ICU that I was eager to try out,
because I wanted to see whether I couild get at full/real Unicode collation
that way, but I had trouble getting the Python bindings to compile.  Not
sure why.  The documentation for the Python bindings isn't very um wordy,
and it isn't clear how tightly integrated it all is: there's talk about C++
strings that kind of scares me. :)

Hm, and maybe they are only for Python 2 not Python 3, which I try to do
all my Python stuff in because it seems like it has a better Unicode model.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-19 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Matthew Barnett rep...@bugs.python.org wrote
   on Fri, 19 Aug 2011 23:36:45 -: 

 For the Line_Break property, one of the possible values is
 Inseparable, with 2 permitted aliases, the shorter IN (which 
 is reasonable) and Inseperable (ouch!).

Yeahy, I've shaken my head at that one, too.

It's one thing to make an alias for something you typo'd in the first 
place, but to have something that's correct which you then make a typo 
alias for is just encouraging bad/sloppy/wrong behavior.

Bidi_Class=Paragraph_Separator
Bidi_Class=Common_Separator
Bidi_Class=European_Separator
Bidi_Class=Segment_Separator
General_Category=Line_Separator
General_Category=Paragraph_Separator
General_Category=Separator
General_Category=Space_Separator
Line_Break=Inseparable
Line_Break=Inseperable

And there's still set, which makes you wonder
why they couldn't spell at least *one* of them out:

Sentence_Break=Sep SB=SE
Sentence_Break=Sp  SB=Sp

You really have to look those up to realize they're two different things:

SB ; SE; Sep
SB ; SP; Sp

And that none of them have something like SB=Space or SB=Separator
so you know what you're talking about.  Grrr.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti ezio.melo...@gmail.com added the comment:

I think the 4 macros:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
are quite straightforward and can avoid using the trailing _.

For what it's worth, I've seen Unicode documentation that talks about
that prefers the terms lead surrogate and trail surrogate as being
clearer than the terms high surrgoate and low   surrogate.

For example, from the Unicode BOM FAQ at http://unicode.org/faq/utf_bom.html

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, 
reserved for use as the leading, and
   trailing values of paired code units in UTF-16. Leading, also called 
high, surrogates are from D800₁₆ to DBFF₁₆,
   and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are 
called surrogates, since they do not
   represent characters directly, but only as a pair.

BTW, considering recent discussions, you might want to read:

Q: Are there any 16-bit values that are invalid?

A: The two values FFFE₁₆ and ₁₆ as well as the 32 values from FDD0₁₆ to 
FDEF₁₆ represent noncharacters. They are
   invalid in interchange, but may be freely used internal to an 
implementation. Unpaired surrogates are invalid as
   well, i.e. any value in the range D800₁₆ to DBFF₁₆ not followed by a 
value in the range DC00₁₆ to DFFF₁₆, or any
   value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range 
D800₁₆ to DBFF₁₆. [AF]

and also the answer to:

Q: Are there any paired surrogates that are invalid?

whose answer I here omit for brevity, as it is a table.

I suspect that you guys are now increasingly sold on the answer to the next FAQ 
right after that one, now. :)

Q: Because supplementary characters are uncommon, does that mean I can 
ignore them?

A: Just because supplementary characters (expressed with surrogate pairs in 
UTF-16) are uncommon does 
   not mean that they should be neglected. They include:

* emoji symbols and emoticons, for interoperating with Japanese mobile 
phones
* uncommon (but not unused) CJK characters, important for personal and 
place names
* variation selectors for ideographic variation sequences
* important symbols for mathematics
* numerous minority scripts and historic scripts, important for some 
user communities

Another example of using lead and trail surrogates is in the first
sentence from http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html

* Naming: For clarity, High and Low surrogates are called Lead and Trail in 
the API, which gives a better sense of
  their ordering in a string. offset16 and offset32 are used to distinguish 
offsets to UTF-16 boundaries vs offsets
  to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as 
opposed to char16, which is a UTF-16
  code unit.
* Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a 
UTF-16 offset and back. Because of the
  difference in structure, you can roundtrip from a UTF-16 offset to a 
UTF-32 offset and back if and only if
  bounds(string, offset16) != TRAIL.
* Exceptions: The error checking will throw an exception if indices are out 
of bounds. Other than than that, all
  methods will behave reasonably, even if unmatched surrogates or 
out-of-bounds UTF-32 values are present.
  UCharacter.isLegal() can be used to check for validity if desired.
* Unmatched Surrogates: If the string contains unmatched surrogates, then 
these are counted as one UTF-32 value.
  This matches their iteration behavior, which is vital. It also matches 
common display practice as missing glyphs
  (see the Unicode Standard Section 5.4, 5.5).
* Optimization: The method implementations may need optimization if the 
compiler doesn't fold static final methods.
  Since surrogate pairs will form an exceeding small percentage of all the 
text in the world, the singleton case
  should always be optimized for.

You can also see this reflected in the utf.h file from the ICU project as part 
of their C API in ICU4C:

#define U_SENTINEL   (-1)
This value is intended for sentinel values for APIs that (take or) 
return single code points (UChar32). 
#define U_IS_UNICODE_NONCHAR(c)
Is this code point a Unicode noncharacter? 
#define U_IS_UNICODE_CHAR(c)
Is c a Unicode code point value (0..U+10) that can be assigned 
a character? 
#define U_IS_BMP(c)   ((uint32_t)(c)=0x)
Is this code point a BMP code point (U+..U+)? 
#define U_IS_SUPPLEMENTARY(c)   ((uint32_t)((c)-0x1)=0xf)
Is this code point a supplementary code point (U+1..U+10)? 
#define U_IS_LEAD(c)   (((c

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.  

I quote a few of these from http://unicode.org/faq/utf_bom.html below:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? 

A: A different issue arises if an unpaired surrogate is encountered when 
converting ill-formed UTF-16 data. 
   By represented such an *unpaired* surrogate on its own as a 3-byte 
sequence, the resulting UTF-8 data stream
   would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires
   that encoding form conversion always results in valid data stream. 
Therefore a converter *must* treat this
   as an error.

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? 

A: If an unpaired surrogate is encountered when converting ill-formed 
UTF-16 data, any conformant converter must
   treat this as an error. By representing such an unpaired surrogate on 
its own, the resulting UTF-32 data stream
   would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires that
   encoding form conversion always results in valid data stream.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If 
yes, then can I still assume the remaining
   UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the 
endianness of the byte stream. UTF-8
   always has the same byte order. An initial BOM is only used as a 
signature — an indication that an otherwise
   unmarked text file is in UTF-8. Note that some recipients of UTF-8 
encoded data do not expect a BOM. Where UTF-8
   is used transparently in 8-bit environments, the use of a BOM will 
interfere with any protocol or file format
   that expects specific ASCII characters at the beginning, such as the use 
of #! of at the beginning of Unix
   shell scripts.

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at 
the beginning of a text stream, U+FEFF
   should normally not occur. For backwards compatibility it should be 
treated as ZERO WIDTH NON-BREAKING SPACE
   (ZWNBSP), and is then part of the content of the file or string. The use 
of U+2060 WORD JOINER is strongly
   preferred over ZWNBSP for expressing word joining semantics since it 
cannot be confused with a BOM. When
   designing a markup language or data protocol, the use of U+FEFF can be 
restricted to that of Byte Order Mark. In
   that case, any U+FEFF occurring in the middle of a file can be treated 
as an unsupported character.

Q: How do I tag data that does not interpret U+FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to 
indicate little-endian UTF-16 text. 
   If you do use a BOM, tag the text as simply UTF-16. 

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data has an associated type, such as a field in a database, a 
BOM is unnecessary. In particular, 
   if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or 
UTF-32LE, a BOM is neither necessary *nor
   permitted*. Any U+FEFF would be interpreted as a ZWNBSP.  Do not tag 
every string in a database or set of fields
   with a BOM, since it wastes space and complicates string concatenation. 
Moreover, it also means two data fields
   may have precisely the same content, but not be binary-equal (where one 
is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two 
sentences here:

Q: What is a UTF?

A: A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate
   code points) to a unique byte sequence. The ISO/IEC 10646 standard uses 
the term “UCS transformation format” for
   UTF; the two terms are merely synonyms for the same concept.

   Each UTF is reversible, thus every UTF supports *lossless round 
tripping*: mapping from any Unicode coded
   character sequence S to a sequence of bytes and back will produce S 
again. To ensure round tripping, a UTF
   mapping *must also* map all code points that are not valid Unicode 
characters to unique byte sequences. These
   invalid code points are the 66 *noncharacters* (including FFFE and 
), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the 
top are quite clear that it is illegal
to have unpaired surrogates in a UTF stream.  I don’t understand therefore what 
it saying about “must also” mapping all
code points that aren’t valid Unicode characters to “unique byte sequences” to 
ensure

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 09:18:46 -: 

 I think the 4 macros:
  #define _Py_UNICODE_ISSURROGATE
  #define _Py_UNICODE_ISHIGHSURROGATE
  #define _Py_UNICODE_ISLOWSURROGATE
  #define _Py_UNICODE_JOIN_SURROGATES
 are quite straightforward and can avoid using the trailing _.

 I don't want to bikeshed, but can we have proper consistent word separation?
 _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE
 (etc.)

Oh good, I thought it was only me whohadtroublereadingthose. :)

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 09:23:50 -: 

 All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER
 and Py_UNICODE_TOLOWER.  I agree that keeping the words separate makes them
 more readable though.

   [0]: Include/unicodeobject.h:328

I am guessing that that is not quite why those don't have underscores
in them.  I bet it is actually something else.  Watch:

% unigrep '^\s*#\s*define\s+Py_[\p{Lu}_]+\b' unicodeobject.h
#define Py_UNICODEOBJECT_H
#define Py_USING_UNICODE
#define Py_UNICODE_WIDE
#define Py_UNICODE_ISSPACE(ch) \
#define Py_UNICODE_ISLOWER(ch) _PyUnicode_IsLowercase(ch)
#define Py_UNICODE_ISUPPER(ch) _PyUnicode_IsUppercase(ch)
#define Py_UNICODE_ISTITLE(ch) _PyUnicode_IsTitlecase(ch)
#define Py_UNICODE_ISLINEBREAK(ch) _PyUnicode_IsLinebreak(ch)
#define Py_UNICODE_TOLOWER(ch) _PyUnicode_ToLowercase(ch)
#define Py_UNICODE_TOUPPER(ch) _PyUnicode_ToUppercase(ch)
#define Py_UNICODE_TOTITLE(ch) _PyUnicode_ToTitlecase(ch)
#define Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
#define Py_UNICODE_ISDIGIT(ch) _PyUnicode_IsDigit(ch)
#define Py_UNICODE_ISNUMERIC(ch) _PyUnicode_IsNumeric(ch)
#define Py_UNICODE_ISPRINTABLE(ch) _PyUnicode_IsPrintable(ch)
#define Py_UNICODE_TODECIMAL(ch) _PyUnicode_ToDecimalDigit(ch)
#define Py_UNICODE_TODIGIT(ch) _PyUnicode_ToDigit(ch)
#define Py_UNICODE_TONUMERIC(ch) _PyUnicode_ToNumeric(ch)
#define Py_UNICODE_ISALPHA(ch) _PyUnicode_IsAlpha(ch)
#define Py_UNICODE_ISALNUM(ch) \
#define Py_UNICODE_COPY(target, source, length) \
#define Py_UNICODE_FILL(target, value, length) \
#define Py_UNICODE_MATCH(string, offset, substring) \
#define Py_UNICODE_REPLACEMENT_CHARACTER ((Py_UNICODE) 0xFFFD)

It looks like what is actually happening there is that you started out
with names of the normal ctype(3) macroish thingies:

 isalpha isupper islower isdigit isxdigit isalnum isspace ispunct
 isprint isgraph iscntrl isblank isascii  toupper isblank isascii
 toupper tolower toascii

and wanted to preserve those, which would lead to Py_UNICODE_TOLOWER and
Py_UNICODE_TOUPPER, since there are no functions in the original C versions
those seem to mirror.  Then when you wanted more of that ilk, you sensibly
kept to the same naming convention.

I eyeball few exceptions to that style here:

% perl -nle '/^\s*#\s*define\s+(Py_[\p{Lu}_]+)\b/ and print $1' Include/*.h 
| sort -dfu | fmt -150
Py_ABSTRACTOBJECT_H Py_ALIGNED Py_ALLOW_RECURSION Py_ARITHMETIC_RIGHT_SHIFT 
Py_ASDL_H Py_AST_H Py_ATOMIC_H Py_BEGIN_ALLOW_THREADS Py_BITSET_H
Py_BLOCK_THREADS Py_BLTINMODULE_H Py_BOOLOBJECT_H Py_BYTEARRAYOBJECT_H 
Py_BYTES_CTYPE_H Py_BYTESOBJECT_H Py_CAPSULE_H Py_CELLOBJECT_H Py_CEVAL_H
Py_CHARMASK Py_CLASSOBJECT_H Py_CLEANUP_SUPPORTED Py_CLEAR 
Py_CODECREGISTRY_H Py_CODE_H Py_COMPILE_H Py_COMPLEXOBJECT_H Py_CURSES_H 
Py_DECREF
Py_DEPRECATED Py_DESCROBJECT_H Py_DICTOBJECT_H Py_DTSF_ALT Py_DTSF_SIGN 
Py_DTST_FINITE Py_DTST_INFINITE Py_DTST_NAN Py_END_ALLOW_RECURSION
Py_END_ALLOW_THREADS Py_ENUMOBJECT_H Py_EQ Py_ERRCODE_H Py_ERRORS_H 
Py_EVAL_H Py_FILEOBJECT_H Py_FILEUTILS_H Py_FLOATOBJECT_H Py_FORCE_DOUBLE
Py_FORCE_EXPANSION Py_FORMAT_PARSETUPLE Py_FRAMEOBJECT_H Py_FUNCOBJECT_H 
Py_GCC_ATTRIBUTE Py_GE Py_GENOBJECT_H Py_GETENV Py_GRAMMAR_H Py_GT
Py_HUGE_VAL Py_IMPORT_H Py_INCREF Py_INTRCHECK_H Py_INVALID_SIZE Py_ISALNUM 
Py_ISALPHA Py_ISDIGIT Py_IS_FINITE Py_IS_INFINITY Py_ISLOWER Py_IS_NAN
Py_ISSPACE Py_ISUPPER Py_ISXDIGIT Py_ITEROBJECT_H Py_LE Py_LISTOBJECT_H 
Py_LL Py_LOCAL Py_LOCAL_INLINE Py_LONGINTREPR_H Py_LONGOBJECT_H Py_LT
Py_MARSHAL_H Py_MARSHAL_VERSION Py_MATH_E Py_MATH_PI Py_MEMCPY 
Py_MEMORYOBJECT_H Py_METAGRAMMAR_H Py_METHODOBJECT_H Py_MODSUPPORT_H 
Py_MODULEOBJECT_H
Py_NAN Py_NE Py_NODE_H Py_OBJECT_H Py_OBJIMPL_H Py_OPCODE_H Py_OSDEFS_H 
Py_OVERFLOWED Py_PARSETOK_H Py_PGEN_H Py_PGENHEADERS_H Py_PRINT_RAW
Py_PYARENA_H Py_PYDEBUG_H Py_PYFPE_H Py_PYGETOPT_H Py_PYMATH_H Py_PYMEM_H 
Py_PYPORT_H Py_PYSTATE_H Py_PYTHON_H Py_PYTHONRUN_H Py_PYTHREAD_H
Py_PYTIME_H Py_RANGEOBJECT_H Py_REFCNT Py_REF_DEBUG Py_RETURN_FALSE 
Py_RETURN_INF Py_RETURN_NAN Py_RETURN_NONE Py_RETURN_TRUE Py_SAFE_DOWNCAST
Py_SET_ERANGE_IF_OVERFLOW Py_SET_ERRNO_ON_MATH_ERROR Py_SETOBJECT_H Py_SIZE 
Py_SLICEOBJECT_H Py_STRCMP_H Py_STRTOD_H Py_STRUCTMEMBER_H Py_STRUCTSEQ_H
Py_SYMTABLE_H Py_SYSMODULE_H Py_TOKEN_H Py_TOLOWER Py_TOUPPER 
Py_TPFLAGS_BASE_EXC_SUBCLASS Py_TPFLAGS_BASETYPE Py_TPFLAGS_BYTES_SUBCLASS
Py_TPFLAGS_DEFAULT Py_TPFLAGS_DICT_SUBCLASS Py_TPFLAGS_HAVE_GC 
Py_TPFLAGS_HAVE_STACKLESS_EXTENSION Py_TPFLAGS_HAVE_VERSION_TAG 
Py_TPFLAGS_HEAPTYPE
Py_TPFLAGS_INT_SUBCLASS Py_TPFLAGS_IS_ABSTRACT Py_TPFLAGS_LIST_SUBCLASS 
Py_TPFLAGS_LONG_SUBCLASS Py_TPFLAGS_READY Py_TPFLAGS_READYING

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Marc-Andre Lemburg rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 12:11:22 -: 

 The reasoning behind e.g. ISSURROGATE is that those names originate
 from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE
 macros which in return stem from the C APIs of the same names
 (see unicodeobject.h for reference).

I eventually figured that part out in the larger context.  
Makes sense looked at that way.

 Regarding low/high vs. lead/trail: The Unicode database uses the terms
 low/high and we do in Python as well, so let's stick with those.

Yes, those are their block assignments,  Block=High_Surrogates and 
Block=Low_Surrogates.
I just thought I should mention that in the time since those were invented 
(which cannot
be changed), after using them in real code for some years, their lingo seems to 
have 
evolved away from those initial names and toward lead/trail as less confusing.

 What I don't understand is why those macros should be declared
 private to Python (with the leading underscore). They are quite
 useful for extensions implementing codecs or other transformations
 as well.

I was wondering about that myself.  Beyond there being a lot fewer of those
private macros in the Python *.h files, they also seem to be of rather different
character than the iswhatever() macros:

$ perl -nle '/^\s*#\s*define\s+(_Py_[\p{Lu}_]+)\b/ and print $1' *.h | sort 
-dfu | fmt -160
_Py_ANNOTATE_BARRIER_DESTROY _Py_ANNOTATE_BARRIER_INIT 
_Py_ANNOTATE_BARRIER_WAIT_AFTER _Py_ANNOTATE_BARRIER_WAIT_BEFORE 
_Py_ANNOTATE_BENIGN_RACE
_Py_ANNOTATE_BENIGN_RACE_SIZED _Py_ANNOTATE_BENIGN_RACE_STATIC 
_Py_ANNOTATE_CONDVAR_LOCK_WAIT _Py_ANNOTATE_CONDVAR_SIGNAL 
_Py_ANNOTATE_CONDVAR_SIGNAL_ALL
_Py_ANNOTATE_CONDVAR_WAIT _Py_ANNOTATE_ENABLE_RACE_DETECTION 
_Py_ANNOTATE_EXPECT_RACE _Py_ANNOTATE_FLUSH_STATE _Py_ANNOTATE_HAPPENS_AFTER
_Py_ANNOTATE_HAPPENS_BEFORE _Py_ANNOTATE_IGNORE_READS_AND_WRITES_BEGIN 
_Py_ANNOTATE_IGNORE_READS_AND_WRITES_END _Py_ANNOTATE_IGNORE_READS_BEGIN
_Py_ANNOTATE_IGNORE_READS_END _Py_ANNOTATE_IGNORE_SYNC_BEGIN 
_Py_ANNOTATE_IGNORE_SYNC_END _Py_ANNOTATE_IGNORE_WRITES_BEGIN 
_Py_ANNOTATE_IGNORE_WRITES_END
_Py_ANNOTATE_MUTEX_IS_USED_AS_CONDVAR _Py_ANNOTATE_NEW_MEMORY 
_Py_ANNOTATE_NO_OP _Py_ANNOTATE_PCQ_CREATE _Py_ANNOTATE_PCQ_DESTROY 
_Py_ANNOTATE_PCQ_GET
_Py_ANNOTATE_PCQ_PUT _Py_ANNOTATE_PUBLISH_MEMORY_RANGE 
_Py_ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX _Py_ANNOTATE_RWLOCK_ACQUIRED 
_Py_ANNOTATE_RWLOCK_CREATE
_Py_ANNOTATE_RWLOCK_DESTROY _Py_ANNOTATE_RWLOCK_RELEASED 
_Py_ANNOTATE_SWAP_MEMORY_RANGE _Py_ANNOTATE_THREAD_NAME 
_Py_ANNOTATE_TRACE_MEMORY
_Py_ANNOTATE_UNPROTECTED_READ _Py_ANNOTATE_UNPUBLISH_MEMORY_RANGE _Py_AS_GC 
_Py_CHECK_REFCNT _Py_COUNT_ALLOCS_COMMA _Py_DEC_REFTOTAL _Py_DEC_TPFREES
_Py_INC_REFTOTAL _Py_INC_TPALLOCS _Py_INC_TPFREES _Py_PARSE_PID 
_Py_REF_DEBUG_COMMA _Py_SET_EDOM_FOR_NAN

 BTW: I think the other issues mentioned in the discussion are more
 important to get right, than the names of those macros.

Yup.  Just paint it red. :)

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote on Mon, 15 Aug 2011 04:56:55 -: 

 Another thing I noticed is that (at least on wide builds) surrogate pairs are 
 not joined on the fly:
  p
 '\ud800\udc00'
  len(p)
 2
  p.encode('utf-16').decode('utf-16')
 ''
  len(_)
 1

(For those who may not immediately realize from reading the surrogates,
 '' is code point 0x1, the first non-BMP code point.  I piped it 
 through `uniquote -x` just to make sure.)

Yes, that makes perfect sense.  It's something of a buggy feature or featureful 
bug
that UTF-16 does this.  

When you are thinking of arbitrary sequences of code points, which is
something you have be able to do in memory but not in a UTF stream, then
one can say that one has four code points of anything in the 0 .. 0x10
range.  Those can be any arbitrary code points only (1) *while* in memory,
*and* assuming a (2) non-UTF16, ie UTF-32 or UTF-8 representation.  You
cannot do that with UTF-16, which is why it works only on a Python wide
build.  Otherwise they join up.

The reason they join up in UTF-16 is also the reason why unlike in regular
memory where you might be able to use an alternate representation like UTF-8 or
UTF-32, UTF streams cannot contain unpaired surrogates: because if that stream
were in UTF-16, you would never be able to tell the difference between a
sequence of a lead surrogate followed by a tail surrogate and the same thing
meaning just one non-BMP code point.  Since you would not be able to tell the
difference, it always only means the latter, and the former sense is illegal.
This is why lone surrogates are illegal in UTF streams.

In case it isn't obvious, *this* is the source of the [풜--풵] bug in all
the UTF-16 or UCS-2 regex languages. It is why Java 7 added \x{...}, so
that they can rewrite that as [\x{1D49C}--\x{1D4B5}] to pass the regex
compiler, so that it seems something indirect, not just surrogates.

That's why I always check it in my cross-language regex tests.  A 16-bit
language has to have a workaround, somehow, or it will be in trouble.

The Java regex compiler doesn't generate UTF-16 for itself, either. It
generates UTF-32 for its pattern.  You can see this right at the start of
the source code.  This is from the Java Pattern class:

/**
 * Copies regular expression to an int array and invokes the parsing
 * of the expression which will create the object tree.
 */
private void compile() {
// Handle canonical equivalences
if (has(CANON_EQ)  !has(LITERAL)) {
normalize();
} else {
normalizedPattern = pattern;
}
patternLength = normalizedPattern.length();

// Copy pattern to int array for convenience
// Use double zero to terminate pattern
temp = new int[patternLength + 2];

hasSupplementary = false;
int c, count = 0;
// Convert all chars into code points
for (int x = 0; x  patternLength; x += Character.charCount(c)) {
c = normalizedPattern.codePointAt(x);
if (isSupplementary(c)) {
hasSupplementary = true;
}
temp[count++] = c;
}

patternLength = count;   // patternLength now in code points

See how that works?  They use an int(-32) array, not a char(-16) array!  It's 
reasonably
clever, and necessary.  Because it does that, it can now compile \x{1D49C} or 
erstwhile
embedded UTF-8 non-BMP literals into UTF-32, and not get upset by the stormy 
sea of
troubles that surrogates are. You can't have surrogates in ranges if you don't 
do
something like this in a 16-bit language.

Java couldn't fix the [풜--풵] bug except by doing the \x{...} indirection trick,
because they are stuck with UTF-16.  However, they actually can match the string
풜 against the pattern ^.$, and have it fail on ^..$.   Yes, I know: the
code-unit length of that string is 2, but its regex count is just one dot worth.

I *believe* they did it that way because tr18 says it has to work that way, but
they may also have done it just because it makes sense.  My current contact at
Oracle doing regex support is not the guy who originally wrote the class, so I
am not sure.  (He's very good, BTW.  For Java 7, he also added named captures,
script properties, *and* brought the class up to conformance with tr18's 
level 1 requirements.)

I'm thinking Python might be able to do in the regex engine on narrow builds 
the 
sort of thing that Java does.  However, I am also thinking that that might
be a lot of work for a situation more readily addressed by phasing out narrow
builds or at least telling people they should use wide builds to get that thing
to work.  

--tom

==


===  QUASI OFF TOPIC ADDENDUM FOLLOWS

[issue12746] normalization is affected by unicode width

2011-08-15 Thread Tom Christiansen


Changes by Tom Christiansen tchr...@perl.com:


--
nosy: +tchrist

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12746
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Unicode character names share a common namespace with formal aliases and with 
named sequences, but Python recognizes only the original name. That means not 
everything in the namespace is accessible from Python.  (If this is construed 
to be an extant bug from than an absent feature, you probably want to change 
this from a wish to a bug in the ticket.)

This is a problem because aliases correct errors in the original names, and are 
the preferred versions.  For example, ISO screwed up when they called U+01A2 
LATIN CAPITAL LETTER OI.  It is actually LATIN CAPITAL LETTER GHA according to 
the file NameAliases.txt in the Unicode Character Database.  However, Python 
blows up when you try to use this:

% env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print(\N{LATIN CAPITAL 
LETTER OI})'
Ƣ

% env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print(\N{LATIN CAPITAL 
LETTER GHA})'
  File string, line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-27: unknown Unicode character name
Exit 1

This unfortunate, because the formal aliases correct egregious blunders, such 
as the Standard reading BRAKCET instead of BRACKET:

$ uninames '^\s+%'
 Ƣ  01A2LATIN CAPITAL LETTER OI
% LATIN CAPITAL LETTER GHA
 ƣ  01A3LATIN SMALL LETTER OI
% LATIN SMALL LETTER GHA
* Pan-Turkic Latin alphabets
 ೞ  0CDEKANNADA LETTER FA
% KANNADA LETTER LLLA
* obsolete historic letter
* name is a mistake for LLLA
 ຝ  0E9DLAO LETTER FO TAM
% LAO LETTER FO FON
= fo fa
* name is a mistake for fo sung
 ຟ  0E9FLAO LETTER FO SUNG
% LAO LETTER FO FAY
* name is a mistake for fo tam
 ຣ  0EA3LAO LETTER LO LING
% LAO LETTER RO
= ro rot
* name is a mistake, lo ling is the mnemonic for 0EA5
 ລ  0EA5LAO LETTER LO LOOT
% LAO LETTER LO
= lo ling
* name is a mistake, lo loot is the mnemonic for 0EA3
 ࿐  0FD0TIBETAN MARK BSKA- SHOG GI MGO RGYAN
% TIBETAN MARK BKA- SHOG GI MGO RGYAN
* used in Bhutan
 ꀕ A015YI SYLLABLE WU
% YI SYLLABLE ITERATION MARK
* name is a misnomer
 ︘ FE18PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
% PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
* misspelling of BRACKET in character name is a known defect
# vertical 3017
 탅  1D0C5   BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
% BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
* misspelling of FTHORA in character name is a known defect

There are only 

In Perl, \N{...} grants access to the single, shared, common namespace of 
Unicode character names, formal aliases, and named sequences without 
distinction:

% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER OI})'
Ƣ
% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER GHA})'
Ƣ

% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER OI})'  | uniquote -x
\x{1A2}
% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER GHA})' | uniquote -x
\x{1A2}

It is my suggestion that Python do the same thing. There are currently only 11 
of these.  

The third element in this shared namespace of name, named sequences, are 
multiple code points masquerading under one name.  They come from the 
NamedSequences.txt file in the Unicode Character Database.  An example entry is:

LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

There are 418 of these named sequences as of Unicode 6.0.0.  This shows that 
Perl can also access named sequences:

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE})'
  Ā̀

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE})' | uniquote -x
  \x{100}\x{300}

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{KATAKANA LETTER 
AINU P})'
  ㇷ゚

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print(\N{KATAKANA LETTER 
AINU P})' | uniquote -x
   \x{31F7}\x{309A}


Since it is a single namespace, it makes sense that all members of that 
namespace should be accessible using \N{...} as a sort of equal-opportunity 
accessor mechanism, and it does not make sense that they not be.

Just makes sure you take only the approved named sequences from the 
NamedSequences.txt file. It would be unwise to give users access to the 
provisional sequences located in a neighboring file I shall not name :) because 
those are not guaranteed never to be withdrawn the way the others are, and so 
you would risk introducing an incompatibility.

If you look at the ICU UCharacter class, you can see that they provide a more

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy tjre...@udel.edu added the comment:

 My Firefox is already set at utf-8. More likely a font limitation. I
 will look again after installing one of the fonts Tom suggested.

Symbola is best for exotic glyphs, especially astral ones.

Alfios just looks nice as a normal default roman.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12730
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy tjre...@udel.edu added the comment:

 You are right, FF switched on me without notice. Bad FF. Thank you! What
 I now see makes much more sense.

[ мЯхШщЯл, мЯхШщЯл, ДЯхШщЯл, ДЇНЀСЇГ  ],

 and I now know to check on other pages (although Tom's Unicode talk
 slides still have boxes even in utf-8, so that must be a font lack).

Do you have Symbola installed?  Here's Appendix I on Fonts for things that
should look right for the presentation to look right.  

* I recommend two free fonts from George Douros at users.teilar.gr/~g1951d/ 
known to
  work with this presentation: his Alﬁos font for regular text, and his 
Symbola font
  for fancy emoji. If any of these don’t look right to you, you probably 
need to
  supplement your system fonts:

Ligatures: ﬁ ﬃ ﬀ ﬄ ﬂ β ẞ ﬅ ﬆ
Math letters: 풜 풟 픅 픎 피 픽
Gothic  Deseret: ̸̼̽͂, ДЯхШщЯл
Symbols: ✔ ✅    
Emoticons:      
Upside‐down: ¡pɐəɥ ɹnoʎ uo ƃuᴉpuɐʇs ʎq sᴉɥʇ pɐəᴚ
Combining characters: ◌̂,◌̃,◌⃞,◌̲,◌︀,◌̵,◌̷

* The last line with combining characters is especially hard to get to look 
right. 
  You may ﬁnd that the shareware font Everson Mono works when all else 
fails.

You do need Unicode 5.1 support for the LATIN CAPITAL LETTER SHARP S, and
you need Unicode 6.0 support for most of the emoji (I think Snow Leopard
has colorized versions of these.  The Ligature line above looks good in Alfios.

It  turns out it may not always the font used with combining chars as it is 
whether and
well your browser supports true combining characters dynamically generated, or 
whether it
runs stuff through NFC and looks for substitution glyphs.  I am not a GUI 
person, so am
mostly just guessing.

But this I find interesting:  If you look at slide 33 of my first talk or slide 
5 of my
second talk, which are duplicates entitled Canonical Conundra, the second 
column which is
labelled Glyphs explicitly uses Time New Roman because of this issue.  Even so 
you can
tell it is doing the NFC trick, because lines 1+2 have the same NFC of \x{F5} 
or õ, as do
3+4+5 with \x{22D} with ȭ, and and 6+7 with ō̃.

The glyphs from the first group are both identical, and so are all three those 
of the
second group, as both the first two groups have a single precomposed character 
available
for their NFC.  In contrast, there is no single precomposed glyph available for 
6+7, and
you can tell that it's stacking it on the fly using slightly less tight 
grouping rules
than the font has in the precomposed versions above it.

I use Safari, but I am told Firefox looks ok, too.  Opera is my normal browser 
but it
does the copout I just described on combining chars without ever being able to
dynamically stack them if the copout fail, so I can't use it for this 
presentation.

--tom

  $ uniprops -a 'LATIN CAPITAL LETTER SHARP S' 'DESERET CAPITAL LETTER DEE' 
'GOTHIC LETTER MANNA' 'SNAKE' 'FACE SCREAMING IN FEAR'

U+1E9E ẞ \N{LATIN CAPITAL LETTER SHARP S}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InLatinExtendedAdditional Cased 
Cased_Letter LC Changes_When_Casefolded CWCF
   Changes_When_Casemapped CWCM Changes_When_Lowercased CWL 
Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base
   Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn 
Latin_Extended_Additional Uppercase_Letter Print Upper
   Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum 
X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper
   X_POSIX_Word
Age=5.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L 
Block=Latin_Extended_Additional Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR 
Canonical_Combining_Class=NR Decomposition_Type=None DT=None
   East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
   Hangul_Syllable_Type=Not_Applicable HST=NA 
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining
   JT=U Joining_Type=U Script=Latin Line_Break=AL Line_Break=Alphabetic 
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN
   NV=NaN Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 
IN=6.0 SC=Latn Script=Latn Sentence_Break=UP
   Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE 
_X_Begin

U+10414 Д \N{DESERET CAPITAL LETTER DEE}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC 
Changes_When_Casefolded CWCF
   Changes_When_Casemapped CWCM Changes_When_Lowercased CWL 
Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base
   Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ 
Uppercase_Letter Print Upper Uppercase Word
   XID_Continue XIDC

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Sorry I didn't include a test case. Hope this makes up for it.  If not, please 
tell me how to write better test cases. :(

Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical 
alignment, but it really helps to make what is different from one to line to 
the next stand out if the parts that are the same from line to line are at the 
same column every time.

--
Added file: http://bugs.python.org/file22902/nametests.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Oh whoops, that was the long ticket.  Shall I reupload to the right number?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy tjre...@udel.edu added the comment:

Adding Symbola filled in the symbols and emoticons lines.
The gothic chars are still missing even with Alfios.

That's too bad, as the Gothic paternoster is kinda cute. :)

Hm, I wonder where I got them from.  I think there must 
be a way to figure that out using the Mac FontBook program,
but I don't know what it is other than pasting them in
the sample screen and scrolling through the fonts to see
how those get rendered.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12730
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Here’s the right test file for the right ticket.

--
Added file: http://bugs.python.org/file22903/nametests.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen


Changes by Tom Christiansen tchr...@perl.com:


Removed file: http://bugs.python.org/file22902/nametests.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti ezio.melo...@gmail.com added the comment:

 It is simply a design error to pretend that the number of characters
 is the number of code units instead of code points.  A terrible and
 ugly one, but it does not mean you are UCS-2.

 If you are referring to the value returned by len(unicode_string), it
 is the number of code units.  This is a matter of practicality beats
 purity.  Returning the number of code units is O(1) (num_of_bytes/2).
 To calculate the number of characters it's instead necessary to scan
 all the string looking for surrogates and then count any surrogate
 pair as 1 character.  It was therefore decided that it was not worth
 to slow down the common case just to be 100% accurate in the
 uncommon case.

If speed is more important than correctness, I can make any algorithm
infinitely fast.  Given the choice between correct and quick, I will 
take correct every single time.

Plus your strings our immutable! You know how long they are and they 
never change.  Correctness comes at a negligible cost.  

It was a bad choice to return the wrong answer.

 That said it would be nice to have an API (maybe in unicodedata or as
 new str methods?) able to return the number of code units, code
 points, graphemes, etc, but I'm not sure that it should be the default
 behavior of len().

Always code points, never code units.  I even use a class whose length
method returns the grapheme count, because even code points aren't good
enough.  Yes of course graphemes have to be counted.  Big deal.   How 
would you like it if you said to move three to the left in vim and 
it *didn't* count each graphemes as one position?  Madness.

 The ugly terrible design error is digusting and wrong, just as much
 in Python as in Java, and perhaps moreso because of the idiocy of
 narrow builds even existing.

 Again, wide builds use twice as much the space than narrow ones, but
 one the other hand you can have fast and correct behavior with e.g.
 len().  If people don't care about/don't need to use non-BMP chars and
 would rather use less space, they can do so.  Until we agree that the
 difference in space used/speed is no longer relevant and/or that non-
 BMP characters become common enough to prefer the correct behavior
 over the fast-but-inaccurate one, we will probably keep both.

Which is why I always put loud warnings in my Unicode-related Python
programs that they do not work right on Unicode if running under
a narrow build.  I almost feel I should just exit.

 I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
 broken in a bunch of ways.  You should be raising as exception in
 all kinds of places and you aren't.

 I am aware of some problems of the UTF-8 codec on Python 2.  It used
 to follow RFC 2279 until last year and now it's been updated to follow
 RFC 3629.

Unicode says you can't put surrogates or noncharacters in a UTF-anything 
stream.  It's a bug to do so and pretend it's a UTF-whatever.

Perl has an encoding form, which it does not call UTF-8, that you 
can use the UTF-8 algorithm on for any code point, include non-characters
and surrogates and even non-Unicode code points far above 0x10_, up
to in fact 0x___ on 64-bit machines.  It's the internal
format we use in memory.  But we don't call it real UTF-8, either.

It sounds like this is the kind of thing that would be useful to you.

 However, for backward compatibility, it still encodes/decodes
 surrogate pairs.  This broken behavior has been kept because on Python
 2, you can encode every code point with UTF-8, and decode it back
 without errors:

No, that's not UTF-8 then.  By definition.  See the Unicode Standard.

 x = [unichr(c).encode('utf-8') for c in range(0x11)]


 and breaking this invariant would probably make more harm than good.

Why?  Create something called utf8-extended or utf8-lax or utf8-nonstrict
or something.  But you really can't call it UTF-8 and do that.  

We actually equate UTF-8 and utf8-strict.  Our internal extended
UTF-8 is something else.  It seems like you're still doing the old
relaxed version we used to have until 2003 or so.  It seems useful
to be able to have both flavors, the strict and the relaxed one,
and to call them different things.  

Perl defaults to the relaxed one, which gives warnings not exceptions,
if you do things like setting PERLUNICODE to S or SD and such for the
default I/I encoding.  If you actually use UTF-8 as the encoding on the 
stream, though, you
get the version that gives exceptions instead.  

UTF-8 = utf8-strict strictly by the standard, raises exceptions 
otherwise
utf8  loosely only, emits warnings on encoding 
illegal things

We currently only emit warnings or raise exceptions on I/O, not on chr
operations and such.  We used to raise exceptions on things like
chr(0xD800), but that was a mistake caused by misunderstanding the in-
memory requirements being

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

On neither narrow nor wide builds does this UTF8-encoded bit run without 
raising an exception: 

   if re.search([풜-풵], 풞, re.UNICODE): 
   print(match 1 passed)
   else:
   print(match 2 failed)

The best you can possibly do is to use both a wide build *and* symbolic 
literals, in which case it will pass. But remove either of both of those 
conditions and you fail.  This is too restrictive for full Unicode use. 

There should never be any sitation where [a-z] fails to match c when a  c  z, 
and neither a nor z is something special in a character class.  There is, or 
perhaps should be, no difference at all between [a-z] and [풜-풵], just as 
there is, or at least should b, no difference between c and 풞. You can’t 
have second-class citizens like this that can't be used.

And no, this one is *not* fixed by Matthew Barnett's regex library. There is 
some dumb UCS-2 assumption lurking deep in Python somewhere that makes this 
break, even on wide builds, which is incomprehensible to me.

--
components: Regular Expressions
files: bigrange.py
messages: 142058
nosy: Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, 
tchrist, terry.reedy
priority: normal
severity: normal
status: open
title: lib re cannot match non-BMP ranges (all versions, all builds)
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file22897/bigrange.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12749
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sun, 14 Aug 2011 07:15:09 -:

 Unicode says you can't put surrogates or noncharacters in a
 UTF-anything stream.  It's a bug to do so and pretend it's a
 UTF-whatever.

 The UTF-8 codec described by RFC 2279 didn't say so, so, since our
 codec was following RFC 2279, it was producing valid UTF-8.  With RFC
 3629 a number of things changed in a non-backward compatible way.
 Therefore we couldn't just change the behavior of the UTF-8 codec nor
 rename it to something else in Python 2.  We had to wait till Python 3
 in order to fix it.

I'm a bit confused on this.  You no longer fix bugs in Python 2?

I've dug out the references that state that you are not allowed to do things the
way you are doing them.  This is from the published Unicode Standard version 
6.0.0,
chapter 3, Conformance.  It is a very important chapter.

http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Python is in violation of that published Standard by interpreting noncharacter 
code
points as abstract characters and tolerating them in character encoding forms 
like
UTF-8 or UTF-16.  This explains that conformant processes are forbidden from 
doing this.

Code Points Unassigned to Abstract Characters

 C1 A process shall not interpret a high-surrogate code point or a 
low-surrogate code point
 as an abstract character.
   · The high-surrogate and low-surrogate code points are designated for 
surrogate
 code units in the UTF-16 character encoding form. They are unassigned 
to any
 abstract character.

==  C2 A process shall not interpret a noncharacter code point as an abstract 
character.
   · The noncharacter code points may be used internally, such as for 
sentinel val-
 ues or delimiters, but should not be exchanged publicly.

 C3 A process shall not interpret an unassigned code point as an abstract 
character.
   · This clause does not preclude the assignment of certain generic 
semantics to
 unassigned code points (for example, rendering with a glyph to 
indicate the
 position within a character block) that allow for graceful behavior in 
the pres-
 ence of code points that are outside a supported subset.
   · Unassigned code points may have default property values. (See D26.)
   · Code points whose use has not yet been designated may be assigned to 
abstract
 characters in future versions of the standard. Because of this fact, 
due care in
 the handling of generic semantics for such code points is likely to 
provide better
 robustness for implementations that may encounter data based on future 
ver-
 sions of the standard.

Next we have exactly how something you call UTF-{8,16-32} must be formed.
*This* is the Standard against which these things are measured; it is not the 
RFC.

You are of course perfectly free to say you conform to this and that RFC, but 
you
must not say you conform to the Unicode Standard when you don't.  These are 
different
things.  I feel it does users a grave disservice to ignore the Unicode Standard 
in
this, and sheer casuistry to rely on an RFC definition while ignoring the 
Unicode
Standard whence it originated, because this borders on being intentionally 
misleading.

Character Encoding Forms

 C8 When a process interprets a code unit sequence which purports to be in 
a Unicode char-
 acter encoding form, it shall interpret that code unit sequence 
according to the corre-
 sponding code point sequence.
==· The specification of the code unit sequences for UTF-8 is given in D92.
   · The specification of the code unit sequences for UTF-16 is given in 
D91.
   · The specification of the code unit sequences for UTF-32 is given in 
D90.

 C9 When a process generates a code unit sequence which purports to be in a 
Unicode char-
 acter encoding form, it shall not emit ill-formed code unit sequences.
   · The definition of each Unicode character encoding form specifies the 
ill-
 formed code unit sequences in the character encoding form. For 
example, the
 definition of UTF-8 (D92) specifies that code unit sequences such as 
C0 AF
 are ill-formed.

== C10 When a process interprets a code unit sequence which purports to be in 
a Unicode char-
 acter encoding form, it shall treat ill-formed code unit sequences as 
an error condition
 and shall not interpret such sequences as characters.
   · For example, in UTF-8 every code unit of the form 1102 must be 
followed
 by a code unit of the form 10xx2. A sequence such as 110x2 
0xxx2
 is ill-formed and must never be generated. When faced with this 
ill-formed
 code unit sequence while transforming or interpreting text, a 
conformant pro-
 cess must treat the first code unit 110x2

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti ezio.melo...@gmail.com added the comment:

On wide 3.2 it passes too, so the failure is limited to narrow builds (are =
you sure that it fails on wide builds for you?).

You're right: my wide build is not Python3, just Python2.  In fact,
it's even worse, because it's the stock build on Linux, which seems
on this machine to be 2.6 not 2.7.

I have private builds that are 2.7 and 3.2, but those are both narrow.
I do not have a 3.3 build.  Should I?

I'm remembering why I removed Python2 from my Unicode talk, because
of how it made me pull my hair out.  People at the talk wanted to know
what I meant, but I didn't have time to go into it.  I think this
gets added to the hairpulling list.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12749
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sun, 14 Aug 2011 07:15:09 -: 

 For example I don't think removing the 0x10 upper limit is going to
 happen -- even if it might be useful for other things. 

I agree entirely.  That's why I appended a triple exclamation point to where I
said I certainly do not expect this.  It can only work fully on UTF-8ish systems
and up to 32 bits on UTF-32, and it is most emphatically *not* Unicode.  Yes,
there are things you can do with it, but it risks serious misunderstanding and
even noncomformance if not done very carefully.  The Standard does not forbid
such things internally, but you are not allowed to pass them around in
noninternal streams claiming they are real UTF streams.

 Also regular expressions are not part of the core and are not used
 that often, so I consider problems with narrow/wide builds, codecs and
 the unicode type much more important than problems with the re/regex
 module (they should be fixed too, but have lower priority IMHO).

One advantage of having an external library is the ability to update
it asynchronously.  Another is the possibility to swap in out altogether.
Perl only gained that ability, which Python has always had, some four
years ago with its 5.10 release.  To my knowledge, the only thing people
tend to use this for is to get Russ Cox's re2 library, which has very
different performance characteristics and guarantees that allow it to 
be used in potential starvation denial-of-service situations that the
normal Perl, Python, Java, etc regex engine cannot be safely used for.

-tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:15:52 -: 

 You're right: my wide build is not Python3, just Python2.

 And is it failing?  Here the tests pass on the wide builds, on both Python 2 
 and 3.

Perhaps I am doing something wrong?

linux% python --version
Python 2.6.2

linux% python -c 'import sys; print sys.maxunicode'
1114111

linux% cat -n bigrange.py
 1  #!/usr/bin/env python
 2  # -*- coding: UTF-8 -*-
 3  
 4  from __future__ import print_function
 5  from __future__ import unicode_literals
 6  
 7  import re
 8  
 9  flags = re.UNICODE
10  
11  if re.search([a-z], c, flags): 
12  print(match 1 passed)
13  else:
14  print(match 1 failed)
15  
16  if re.search([풜-풵], 풞, flags): 
17  print(match 2 passed)
18  else:
19  print(match 2 failed)
20  
21  if re.search([\U0001D49C-\U0001D4B5], \U0001D49E, flags): 
22  print(match 3 passed)
23  else:
24  print(match 3 failed)
25  
26  if re.search([\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT 
CAPITAL Z}],
27  \N{MATHEMATICAL SCRIPT CAPITAL C}, flags): 
28  print(match 4 passed)
29  else:
30  print(match 4 failed)

linux% python bigrange.py
match 1 passed
Traceback (most recent call last):
  File bigrange.py, line 16, in module
if re.search([풜-풵], 풞, flags): 
  File /usr/lib64/python2.6/re.py, line 142, in search
return _compile(pattern, flags).search(string)
  File /usr/lib64/python2.6/re.py, line 245, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range

 In fact, it's even worse, because it's the stock build on Linux, 
 which seems on this machine to be 2.6 not 2.7.

 What is worse?  FWIW on my system the default `python` is a 2.7 wide. 
 `python3` is a 3.2 wide.

I meant that it was running 2.6 not 2.7.  

 I have private builds that are 2.7 and 3.2, but those are both narrow.
 I do not have a 3.3 build.  Should I?

 3.3 is the version in development, not released yet.  If you have an
 HG clone of Python you can make a wide build of 3.x with ./configure
 --with-wide-unicode andof 2.7 using ./configure --enable-
 unicode=ucs4.

And Antoine Pitrou pit...@free.fr wrote:

 I have private builds that are 2.7 and 3.2, but those are both narrow.
 I do not have a 3.3 build.  Should I?

 I don't know if you *should*. But you can make one easily by passing
 --with-wide-unicode to ./configure.

Oh good.  I need to read configure --help more carefully next time.
I have to some Lucene work this afternoon, so I can let several builds
chug along.  

Is there a way to easily have these co-exist on the same system?  I'm sure
I have to rebuild all C extensions for the new builds, but I wonder what to
about (for example) /usr/local/lib/python3.2 being able to be only one of
narrow or wide.  Probably I just to go reading the configure stuff better
for alternate paths.  Unsure.  

Variant Perl builds can coexist on the same system with some directories
shared and others not, but I often find other systems aren't quite that
flexible, usually requiring their own dedicated trees.  Manpaths can get
tricky, too.

 I'm remembering why I removed Python2 from my Unicode talk, because
 of how it made me pull my hair out.  People at the talk wanted to know
 what I meant, but I didn't have time to go into it.  I think this
 gets added to the hairpulling list.

 I'm not sure what you are referring to here.

There seem to many more things to get wrong with Unicode in v2 than in v3.

I don't know how much of this just my slowness at ramping up the learning
curve, how much is due to historical defaults that don't work well for 
Unicode, and how much is 

Python2:

re.search(u[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT 
CAPITAL Z}].encode('utf-8'), 
   u\N{MATHEMATICAL SCRIPT CAPITAL C}.encode('utf-8'), re.UNICODE)

Python3:

re.search([\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT 
CAPITAL Z}],
   \N{MATHEMATICAL SCRIPT CAPITAL C}, re.UNICODE)

The Python2 version is *much* noisier.  

(1) You have keep remembering to u... everything because neither
# -*- coding: UTF-8 -*-
nor even
from __future__ import unicode_literals
suffices.  

(2) You have to manually encode every string, which is utterly bizarre to me.

(3) Plus you then have turn around and tell re, Hey by the way, you know those
Unicode strings I just passed you?  Those are Unicode strings, you know.
Like it couldn't tell that already by realizing it got Unicode not byte 
strings.  So weird.

It's a very awkward model.  Compare Perl's

   \N{MATHEMATICAL SCRIPT CAPITAL C} =~ /\N{MATHEMATICAL SCRIPT CAPITAL 
A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/

That's the kind of thing I'm used

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Sun, 14 Aug 2011 17:46:55 -: 

 I'm a bit confused on this.  You no longer fix bugs in Python 2?

 We do, but it's unlikely that we will introduce major changes in behavior.

 Even if we had to get rid of narrow builds and/or fix len(), we would
 probably only do it in the next 3.x version (i.e. 3.3), and not in the
 next bug fix release of 3.2 (i.e. 3.2.2).

Antoine Pitrou rep...@bugs.python.org wrote
   on Sun, 14 Aug 2011 17:36:42 -:

 This is even truer for stable branches, and Python 2 is very much a
 stable branch now (no more feature releases after 2.7).

Does that mean you now go to 2.7.1, 2.7.2, etc?

I had thought that 2.6 was going to be the last, but then 2.7
ame out.  I think I remember Guido said something about there 
never being a 2.10, so I wasn't too surprised to see 2.7.  

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy rep...@bugs.python.org wrote
   on Mon, 15 Aug 2011 00:26:53 -: 

 PS: The OSCON link in msg142036 currently gives me 404 not found

Sorry, I wrote 

 http://training.perl.com/OSCON/index.html

but meant 

 http://training.perl.com/OSCON2011/index.html

I'll fix it on the server in a short spell.

I am trying to keep the document up to date as I learn more, so it
isn't precisely the talk I gave in Portland.

 Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

So I'm finding.  Perhaps that's why I keep getting confused. I do have a pretty 
firm
notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory 
results.
Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm not 
sure I
have found one, although the regex thing might count.

Thank you guys for being so helpful and understanding.

 They support non-BMP chars but only partially, because, BY DESIGN*,
 indexing and len are by code units, not codepoints. 

That's what Java did, too, and for the same reason.  Because they had
a UCS-2 implementation for Unicode 1.1 so when Unicode 2.0 came out
and they learned that they would need more than 16 bits, they piggybacked
UTF-16 onto the top of it instead of going for UTF-8 or UTF-32, and they're
still paying that price, and to my mind, heavily and continually.

Do you use Java?  It is very like Python in many of its 16-bit character issues.
Most of the length and indexing type functions address things by code unit
only, not copepoint.  But they would never claim to be UCS-2.

Oh, I realize why they did it.  For one thing, they had bytecode out there
that they had to support.  For another, they had some pretty low-level APIs
that didn't have enough flexibility of abstraction, so old source had to keep
working as before, even though this penalized the future.  Forever, kinda.

While I wish they had done better, and kinda think they could have, it
isn't my place to say.  I wasn't there (well, not paying attention) when
this was all happening, because I was so underwhelmed by the how annoyingly
overhyped it was.  A billion dollars of marketing can't be wrong, you know?
I know that smart people looked at it, seriously.  I just find the cure
they devised to be more in the problem set than the solution set.

I like how Python works on wide builds, especially with Python3. I was
pretty surprised that the symbolic names weren't working right on the
earlier version of the 2.6 wide build I tried them on.

I know have both wide and narrow builds installed of both 2.7 and 3.2,
so that shouldn't happen again.

 They are documented as being UCS-2 because that is what M-A Lemburg,
 the original designer and writer of Python's unicode type and the unicode-
 capable re module, wants them to be called. The link to msg142037,
 which is one of 50+ in the thread (and many or most other disagree),
 pretty well explains his viewpoint.

Count me as one of those many/most others who disagree. :)

 The positive side is that we deliver more than we promise. The
 negative side is that by not promising what perhaps we should allows
 is not to deliver what perhaps we should.

It is always better to deliver more than you say than to deliver less.

 * While I think this design decision may have been OK a decade ago for
   a first implementation of an *optional* text type, I do not think it
   so for the future for revised implementations of what is now *the*
   text type. I think narrow builds can and should be revised and
   upgraded to index, slice, and measure by codepoints. 

Yes, I think so, too.  If you look at the growth curve of UTF-8 alone,
it has followed a mathematically exponential growth curve in the 
first decade of this century.  I suspect that will turn into an S
surve with with aymtoptotic shoulders any time now.  I haven't looked
at it lately, so maybe it already has.  I know that huge corpora I work
with at work are all absolutely 100% Unicode now.  Thank XML for that.

 Here is my current idea:

 If the code unit stream contains any non-BMP characters (ie, surrogate
 pair of 16-bit code units), construct a sequence of *indexes* of such
 characters (pairs). The fixed length of the string in codepoints is
 n-k, where n is the number of code units (the current length) and k is
 the length of the auxiliary sequence and the number of pairs. For
 indexing, look up the character index in the list of indexes by binary
 search and increment the codepoint index by the index of the index
 found to get the corresponding code unit index. (I have omitted the
 details needed avoid off-by-1 errors.)

 This would make indexing O(log(k)) when there are surrogates. If that
 is really a problem because k is a substantial fraction of a 'large'
 n, then one should use a wide build. By using a separate internal
 class, there would be no time or space penalty for all-BMP text. I
 will work on a prototype

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-14 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I wrote:

 Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

 So I'm finding.  Perhaps that's why I keep getting confused. I do have a 
 pretty firm
 notion of what UCS-2 and UTF-16 are, and so I get sometimes 
 self-contradictory results.
 Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm 
 not sure I
 have found one, although the regex thing might count.

I just thought of one.  The casemapping functions don't work right on
Deseret, which is a non-BMP case-changing scripts.  That's one I submitted
as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP
code points into paired UTF-16 surrogates, then the casing functions had
jolly well be able to deal with it.  If the UTF-8 decoder knows it is only
going to UCS-2, then it should have raised on exception on my non-BMP source.
Since it went to UTF-16, the rest of the language should have behaved 
accordingly.
Java does to this right, BTW, despite its UTF-16ness.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

David Murray rep...@bugs.python.org wrote:

 Tom, note that nobody is arguing that what you are requesting is a bad
 thing :)

There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding changing anything *at all* in re,
even things that to my jaded seem like actual bugs.

There are bugs, and then there are bugs.

In my survey of Unicode support across 7 programming languages for OSCON

http://training.perl.com/OSCON/index.html

I came across a lot of weirdnesses, especially as first when the learning
curve was high.  Sure, I found it odd that unlike Java, Perl, and Ruby,
Python didn't offer regular casemapping on strings, only the simple
character-based mapping.  But that doesn't make it a bug, which is why
I filed it as an feature/enhancement request/wish, not as a bug.

I always count as bugs not handling Unicode text the way Unicode says
it must be handled.  Such things would be:

Emitting CESU-8 when told to emit UTF-8.

Violating the rule that UTF-8 must be in the shortest possible encoding.

Not treating a code point as a letter when the supported version of the
UCD says it is.  (This can happen if internal rules get out of sync
with the current UCD.)

Claiming one does the expected thing on Unicode for case-insensitive
matches when not doing what Unicode says you must minimally do: use at
least the simple casefolds, if not in fact the full ones.

Saying \w matches Unicode word characters when one's definition of
word characters differs from that of the supported version of the UCD.

Supporting Unicode vX.Y.Z is more than adding more characters.  All the
behaviors specified in the UCD have to be updated too, or else you are just
ISO 10646.  I believe some of Python's Unicode bugs happened because folks
weren't aware which things in Python were defined by the UCD or by various
UTS reports yet were not being directly tracked that way.  That's why its
important to always fully state which version of these things you follow.

Other bugs, many actually, are a result of the narrow/wide-build untransparency.

There is wiggle room in some of these.  For example, which is the one that
applies to re, in that you could -- in a sense -- remove the bug by no longer
claiming to do case-insensitive matches on Unicode.  I do not find that very
useful. Javascript works this way: it doesn't do Unicode casefolding.  Java you
have to ask nicely with the extra UNICODE_CASE flag, aka (?u), used with the
CASE_INSENSITIVE, aka (?i).

Sometimes languages provide different but equivalent interfaces to the same
functionality.  For example, you may not support the Unicode property
\p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and
hopefully also in strings.  That's just fine.  On slightly shakier ground but
still I think defensible is how one approaches support for the standard UCD
properties:

  Case_FoldingSimple_Case_Folding
 Titlecase_MappingSimple_Titlecase_Mapping
 Uppercase_MappingSimple_Uppercase_Mapping
 Lowercase_MappingSimple_Lowercase_Mapping

One can support folding, for example, via (?i) and not have to
directly supporting a Case_Folding property like \p{Case_Folding=s},
since (?i)s should be the same thing as \p{Case_Folding=s}.

 As far as I know, Matthew is the only one currently working on the
 regex support in Python.  (Other developers will commit small fixes if
 someone proposes a patch, but no one that I've seen other than Matthew
 is working on the deeper issues.)  If you want to help out that would
 be great.

Yes, I actually would.  At least as I find time for it.  I'm a competent C
programmer and Matthew's C code is very well documented, but that's very
time consuming.  For bang-for-buck, I do best on test and doc work, making
sure things are actually working the way they say do.

I was pretty surprised and disappointed by how much trouble I had with
Unicode work in Python.  A bit of that is learning curve, a bit of it is
suboptimal defaults, but quite a bit of it is that things either don't work
the way Unicode says, or because something is altogether missing.  I'd like
to help at least make the Python documentation clearer about what it is
or is not doing in this regard.

But be warned: one reason that Java 1.7 handles Unicode more according to
the published Unicode Standard in its Character, String, and Pattern
classes is because when they said they'd be supporting Unicode 6.0.0,
I went through those classes and every time I found something in violation
of that Standard, I filed a bug report that included a documentation patch
explaining what they weren't doing right.  Rather than apply my rather
embarrassing doc patches, they instead fixed the code. :)

 And as far as this particular issue goes, yes the difference between
 the narrow and wide build has been a known issue for a long time

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Matthew Barnett rep...@bugs.python.org wrote
   on Sat, 13 Aug 2011 20:57:40 -: 

 There are occasions when you want to do string slicing, often of the form:

   pos = my_str.index(x)
   endpos = my_str.index(y)
   substring = my_str[pos : endpos]

Me, I would probably give the second call to index the first  
index position to guarantee the end comes after the start:

str  = for finding the biggest of all the strings
x_at = str.index(big)
y_at = str.index(the, x_at)
some = str[x_at:y_at]
print(GOT, some)

But here's a serious question: is that *actually* a common usage pattern
for accessing strings in Python?  I ask because it wouldn't even *occur* to
me to go at such a problem in that way.  I would have always just written
it this way instead:

import re
str  = for finding the biggest of all the strings
some = re.search((big.*?)the, str).group(1)
print(GOT, some)

I know I would use the pattern approach, just because that's 
how I always do such things in Perl:

$str  = for finding the biggest of all the strings;
($some) = $str =~ /(big.*?)the/;
print GOT $some\n;

Which is obviously a *whole* lot simpler than the index approach:

$str  = for finding the biggest of all the strings;
$x_at = index($str, big);
$y_at = index($str, the, $x_at);
$len  = $y_at - $x_at;
$some = substr($str, $x_at, $len);
print GOT $some\n;

With no arithmetic and no need for temporary variables (you can't really
escape needing x_at to pass to the second call to index), it's all a
lot more WYSIWIG.  See how much easier that is?  

Sure, it's a bit cleaner and less noisy in Perl than it is in Python by
virtue of Perl's integrated pattern matching, but I would still use
patterns in Python for this, not index.  

I honestly find the equivalent pattern operations a lot easier to read and write
and maintain than I find the index/substring version.  It's a visual thing.  
I find patterns a win in maintainability over all that busy index monkeywork.  
The index/rindex and substring approach is one I almost never ever turn to.
I bet I use pattern matching 100 or 500 times for each time I use index, and
maybe even more.

I happen to think in patterns.  I don't expect other people to do so.  But
because of this, I usually end up picking patterns even if they might be a
little bit slower, because I think the gain in flexibility and especially
maintability more than makes up for any minor performance concerns.

This might also show you why patterns are so important to me: they're one
of the most important tools we have for processing text.  Index isn't,
which is why I really don't care about whether it has O(1) access.  

 To me that suggests that if UTF-8 is used then it may be worth
 profiling to see whether caching the last 2 positions would be
 beneficial.

Notice how with the pattern approach, which is inherently sequential, you don't
have all that concern about running over the string more than once.  Once you
have the first piece (here, big), you proceed directly from there looking for
the second piece in a straightforward, WYSIWIG way.  There is no need to keep an
extra index or even two around on the string structure itself, going at it this 
way.

I would be pretty surprised if Perl could gain any speed by caching a pair of
MRU index values against its UTF-8 [but see footnote], because again, I think
the normal access pattern wouldn't make use of them.  Maybe Python programmers
don't think of strings the same way, though.  That, I really couldn't tell you.

But here's something to think about:

If it *is* true that you guys do all this index stuff that Perl programmers
just never see or do because of our differing comfort levels with regexes,
and so you think Python that might still benefit from that sort of caching 
because its culture has promoted a different access pattern, then that caching 
benefit would still apply even if you were retain the current UTF-16 
representation
instead of going to UTF-8 (which might want it) or to UTF-32 (which wouldn't).

After all, you have the same variable-width caching issue with UTF-16 as with
UTF-8, so if it makes sense to have an MRU cache mapping character indices to
byte indices, then it doesn't matter whether you use UTF-8 or UTF-16!

However, I'd want some passive comparative benchmarks using real programs with
real data, because I would be suspicious of incurring the memory cost of two
more pointers in every string in the whole program.  That's serious.

--tom

FOOTNOTE: The Perl 6 people are thinking about clever ways to set up byte
  offset indices.  You have to do this if you want O(1) access to the
  Nth element for elements that are not simple code points even if you
  use UTF-32.  That's because they want the default string element to be
  a user visible grapheme, not a code point.  I know they have clever

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote
   on Sat, 13 Aug 2011 21:09:52 -: 

 And/or a lookup table giving the byte offset of, say, every 16th
 character. It gives you a O(1) lookup with a relatively reasonable
 constant cost (you have to scan for less than 16 characters after the
 lookup).

 On small strings ( 256 UTF-8 bytes) the space overhead for the lookup
 table would be 1/16. It could also be constructed lazily whenever more
 than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic NFG (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
 Perhaps someone could tell me why the Python documentation says it uses
 UCS-2 on a narrow build.

 There's a disagreement on that point between several developers. 
 See an example sub-thread at:

   http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Some of those folks know what they're talking about, and some do not.

Most of the postings miss the mark.

Python uses UTF-16 for its narrow builds.  It does not use UCS-2.

The argument that it must be UCS-2 because it can store lone surrogates
in memory is spurious.

You have to read The Unicode Standard very *very* closely, but it is not
necessary that all internal buffers always be in well-formed UTF-whatever.
Otherwise it would be impossible to append a code unit at a time to buffer.
I could pull out the reference if I worked at it, because I've had to find
it before.  It's in there.  Trust me.  I know.

It is also spurious to pretend that because you can produce illegal output
when telling it to generate something in UTF-16 that it is somehow not using
UTF-16.  You have simply made a mistake.  You have generated something  that
you have promised you would not generate.   I have more to say about this below.

Finally, it is spurious to argue against UTF-16 because of the code unit
interface.  Java does exactly  the same thing as Python does *in all regards*
here, and no one pretends that Java is UCS-2.  Both are UTF-16.

It is simply a design error to pretend that the number of characters
is the number of code units instead of code points.  A terrible and
ugly one, but it does not mean you are UCS-2.

You are not.  Python uses UTF-16 on narrow builds.  

The ugly terrible design error is digusting and wrong, just as much in 
Python as in Java, and perhaps moreso because of the idiocy of narrow
builds even existing.  But that doesn't make them UCS-2.

If I could wave a magic wand, I would have Python undo its code unit
blunder and go back to code points, no matter what.  That means to stop
talking about serialization schemes and start talking about logical code
points.  It means that slicing and index and length and everything only
report true code points.  This horrible code unit botch from narrow builds
is most easily cured by moving to wide builds only.

However, there is more.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken
in a bunch of ways.  You should be raising as exception in all kinds of
places and you aren't.  I can see I need to bug report this stuff to.  
I don't to be mean about this.  HONEST!  It's just the way it is.

Unicode currently reserves 66 code points as noncharacters, which it 
guarantees will never be in a legal UTF-anything stream.  I am not talking 
about surrogates, either.

To start with, no code point which when bitwise added with 0xFFFE returns
0xFFFE can never appear in a valid UTF-* stream, but Python allow this
without any error.

That means that both 0xNN_FFFE and 0xNN_ are illegal in all planes,
where NN is 00 through 10 in hex.  So that's 2 noncharacters times 17 
planes = 34 code points illegal for interchange that Python is passing 
through illegally.  

The remaining 32 nonsurrogate code points illegal for open interchange
are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
doesn't seem to care.

You simply cannot say you are generating UTF-8 and then generate a byte
sequence that UTF-8 guarantees can never occur.  This is a violation.

***SIGH***

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11230] Full unicode import system not in 3.2

2011-08-12 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Whoops, I meant that it appears that Python runs its identifiers through NFC.  
How that gets along with a filesystem that has quasi-NFD filenames I'm not 
sure, but it seems like it might be a variant of the case-insensitivity issue 
in filenames.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11230
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-12 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Terry J. Reedy tjre...@udel.edu added the comment:

 I am not sure that everyone will agree that this is a bug, rather than a fe=
 ature request, or that if a bug, that it should be changed in existing rele=
 ases and possibly break running code. The doc just says, somewhat vaguely, =
 that IGNORECASE works for Unicode characters as expected. I have added ot=
 hers as nosy for their opinions.

Working as expected for Unicode characters means it must the Unicode's
rules for casefolding.  Otherwise you don't have Unicode at all; you just 
have ISO 10646.  Unicode is not merely a larger character repertoire; again,
that is merely ISO 10646.  Unicode is all about the rules for processing this
larger repertoire.  This is a very common mistake, so common that it is in the 
Unicode FAQ:

Q: What is the relation between ISO/IEC 10646 and Unicode?

A: In 1991, the ISO Working Group responsible for ISO/IEC 10646 (JTC
   1/SC 2/WG 2) and the Unicode Consortium decided to create one
   universal standard for coding multilingual text. Since then, the
   ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium
   have worked together very closely to extend the standard and to
   keep their respective versions synchronized. [EH]

Q: So are they the same thing?

A: No. Although the character codes and encoding forms are
   synchronized between Unicode and ISO/IEC 10646, the Unicode
   Standard imposes additional constraints on implementations to
   ensure that they treat characters uniformly across platforms and
   applications. To this end, it supplies an extensive set of
   functional character specifications, character data, algorithms
   and substantial background material that is *not* in ISO/IEC 10646.

http://unicode.org/faq/unicode_iso.html

Part of those functional character specifications can be found in the three
casefolding fields of the file UnicodeData.txt and also in two auxiliary
files of the Unicode distribution, CaseFolding.txt and SpecialCasing.txt.
The Unicode Character Database is not optional.  If you do not use it, you
do not have Unicode; instead you merely have ISO 10646, which is of zero
practical use to anyone compared with Unicode.  I'm sure that Python would
not want to be stuck having something of no use to anyone when everyone
else actually supports Unicode.

One is not allowed to make up one's own rules that run counter to Unicode's
and still make the claim that one is working on Unicode, since that is in
fact not what one is doing.  Based on all that, Python does not do case
insensitive matching on Unicode, a condition contrary to its documented
claims.  That clearly makes it a bug that needs fixing rather than a 
feature request to be summarily ignored.

 The test file should have omitted the gratuitous and distracting warnings, =
 especially the one that effectively scolds Windows users for running Window=
 s. With those omitted, the test cases given would form the basis for an add=
 ed TestCase.

I have absolutely no idea what on earth you could possibly be referring to.
Honestly.  I ran my tests on both releases (2.7 and 3.2), on both builds
(wide and narrow), and on both platforms (Unix and Mac).  The warnings are
in there so I can make sure I have everything set up correctly to run the 
tests, and will understand why I get more failures than expected in the event 
that things are not set up appropriately.

Let me make perfectly clear that I have never in my life come anywhere near a
Microsoft system, let alone touched one, and that I furthermore never shall.  
I have not the foggiest notion what in the world you are complaining about.
If the problem is that you are for some reason unable to create a Python with
full Unicode support under Microsoft, that is hardly my fault.   Render unto
Caesar that which is Caesar's: complain to Microsoft about Microsoft's bugs,
not to me, as I am wholly blameless of their problems.

If you don't like my test cases, you know where to find vi.  

I supposed I could always send you the program that writes these programs
for me, but as I knew you won't like it, I withheld it.  You already have
all that you need to see exactly where the bugs are and how to fix them.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12728
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-12 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy rep...@bugs.python.org wrote
   on Fri, 12 Aug 2011 22:21:59 -: 

 Does the regex module handle these particular issues better?

No, it currently does not.  One would have to ask Matthew directly, but I
believe it was because he was trying to stay compatible with re, sometimes
apparently even if that means being bug compatible.  I have brought it 
to his attention though, and at last report he was pondering the matter.

In contrast to how Python behaves on narrow builds, even though Java uses
UTF-16 for its internal representation of strings, its Java Pattern is
quite adamant about treating with logical code points alone.  Besides
running afoul of tr18, it is senseless to do otherwise.  A dot is one
Unicode code point, no matter whether you have 8-bit code units, 16-bit
code units, or 32-bit code units.  Similarly, character classes and their
negations only match entire code points, never pieces of the same.

ICU's regexes work the same way the normal Java Pattern library does.  
So too do Perl, Ruby, and Go.  Python is really the odd man out here.  

Almost.

One interesting counterexample is the vim editor.  It has dot match a
complete grapheme no matter how many code points that requires, because
we're dealing with user-visible characters now, not programmer-visible one.

It is an unreasonable burden to make the programmer deal with the
fine-grained details of low-level serialization schemes instead of at
least(*) the code point level of operations, which is the minimum for
getting real work done.  (*Note that tr18 admits that accessing text at the
code point level meets only programmer expectations, not those of the user,
and therefore to meet user expectations much more elaborate patterns must
necessarily be constructed than if logical groups of coarser granularity
than code points alone are supported.)

Python should not be subject to changing its behavior from one build to the
next.  This astonishing narrow-vs-wide build behavior makes it virtually
impossible to write portable code to work on arbitrary Unicode text. You
cannot even know whether you need to match one dot or two to get a single
code point, and similarly for character indexing, etc. Even identifiers
come into play.  Surrogates should be utterly nonexistent/invisible at
this, the normal level of operation.  

An API that minimally but uniformly deals with logical code points and
nothing finer in granularity is the only way to go here.  Please trust me
on this one.  Graphemes (tr18 Level 2) and collation elements (Level 3)
will someday build on that, but one must first support code points
properly. That's why it's a Level 1 requirement.

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-12 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

 Terry J. Reedy tjre...@udel.edu added the comment:

 However desireable it would be, I do not believe there is any claim in the =
 manual that the re module follows the evolving Unicode consortium r.e. stan=

My from the hip thought is that if re cannot be fixed to follow
the Unicode Standard, it should be deprecated in favor of code
that can if such is available, because you cannot process Unicode
text with regular expressions otherwise.

 dard. If I understand, you are saying that this statement in the doc, Matc=
 hes Unicode word characters; is not now correct and should be revised. Was=
  it once correct? Could we add by an older definition of 'word' character=
 ?

Yes, your hunch is exactly correct.  They once had a lesser definition that
they have now.  It is very very old.  I had to track this down for Java
once.  There is some discussion of a word_character class at least 
as far back as tr18v3 from back in 1998.

http://www.unicode.org/reports/tr18/tr18-3.html

By the time tr18v5 rolled around just a year later in 1999, the overall
document has changed substantially, and you can clearly see its current
shape there.  Word characters are supposed to include all code points with
the Alphabetic property, for example.  

http://www.unicode.org/reports/tr18/tr18-5.html

However, the word alphabetic has *never* been synonymous in 
Unicode with 

\p{gc=Lu}
\p{gc=Ll}
\p{gc=Lt}
\p{gc=Lm}
\p{gc=Lo}

as many people incorrectly assume, nor certainly to 

\p{gc=Lu}
\p{gc=Ll}
\p{gc=Lt}

let alone to 

\p{gc=Lu}
\p{gc=Ll}

Rather, it has since its creation included code points that are not
letters, such as all GC=Nl and also certain GC=So code points.  And,
notoriously, U+0345. Indeed it is here I first noticed that that Python had
already broken with the Standard, because U+0345 COMBINING GREEK
YPOGEGRAMMENI is GC=Mn, but Alphabetic=True, yet I have shown that 
Python's title method is messing up there.  

I wouldn't spend too much in archaeological digs, though, because lots of
stuff has changed since the less millennium.  It was in tr18v7 from 2003-05
that we hit paydirt, because this is when the famous Annex C of RL1.2a 
fame first appeared:

http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

Notice how it defines \w to be nothing more than \p{alpha}, \p{digit}, and
\p{gc=Pc}.  It does not yet contain the requirement that all Marks be
counted as part of the word, just the few that are alphas -- which the
U+0345 counts for, since it has an uppercase map of a capital iota!

That particular change did not occur until tr18v8 in 2003-08, barely
a scant three months later.

http://www.unicode.org/reports/tr18/tr18-8.html#Compatibility_Properties

Now at last we see word characters defined in the modern way that we 
have become used to.  They must match any of:

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}

BTW, Python is matching  all of 

\p{GC=N}

meaning

\p{GC=Nd}
\p{GC=Nl}
\p{GC=No}

instead of the required 

\p{GC=Nd}

which is a synonym for \p{digit}.

I don't know had that happened, because \w has never included
all number code points in Unicode, only the decimal number ones.

That all goes to show why, when citing conformance to some aspect of 
The Unicode Standard, one must be exceedingly careful just how one 
does so!
The Unicode Consortium recognizes this is an issue, and I am pretty
sure I can hear it in your own subtext as well.  

Kindly bear with and forgive me for momentarily sounding like a standard
lawyer.  I do this because to show not just why it is important to get
references to the Unicode Standard correct, but indeed, how to do so.

After I have given the formal requirements, I will then produce
illustrations of various purported claims, some of which meet the
citation requirements, and others which do not.

===

To begin with, there is an entire technical report on conformance.
It includes:

http://unicode.org/reports/tr33/

The Unicode Standard [Unicode] is a very large and complex standard.
Because of this complexity, and because of the nature and role of the
standard, it is often rather difficult to determine, in any particular
case, just exactly what conformance to the Unicode Standard means.

...

Conformance claims must be specific to versions of the Unicode
Standard, but the level of specificity needed for a claim may vary
according to the nature of the particular conformance claim. Some
standards developed by the Unicode Consortium require separate
conformance to a specific version (or later), of the Unicode Standard.
This version is sometimes called the  base version. In such cases, the
version of the standard and the version of the Unicode Standard to
which the conformance claim

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-12 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Terry J. Reedy rep...@bugs.python.org wrote
   on Fri, 12 Aug 2011 23:05:27 -: 

 Ouch!

 Do the rejected characters qualify as identifier characters as defined
 in Reference 2.3 Identifiers and keywords?

 http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers

Yes, that's right, they do.  You're using the standard IDS and IDC, and
XIDS and XIDC, definitions.  Here were the three identifiers that were
a problem:

픘픫픦픠픬픡픢 = super
ДЯхШщЯл = Deseret
̰̰̈́̈́‿̰̿̽̓͂  = Gothic our father

If you cannot read those, then when piped through `uniquote -v` they are:

\N{MATHEMATICAL FRAKTUR CAPITAL U}\N{MATHEMATICAL FRAKTUR SMALL 
N}\N{MATHEMATICAL FRAKTUR SMALL I}\N{MATHEMATICAL FRAKTUR SMALL 
C}\N{MATHEMATICAL FRAKTUR SMALL O}\N{MATHEMATICAL FRAKTUR SMALL 
D}\N{MATHEMATICAL FRAKTUR SMALL E} = super
\N{DESERET CAPITAL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET 
SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER 
ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE} = Deseret
\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER TEIWS}\N{GOTHIC LETTER 
TEIWS}\N{GOTHIC LETTER AHSA}\N{UNDERTIE}\N{GOTHIC LETTER URUS}\N{GOTHIC LETTER 
NAUTHS}\N{GOTHIC LETTER SAUIL}\N{GOTHIC LETTER AHSA}\N{GOTHIC LETTER RAIDA}  = 
Gothic our father

I'm not sure whether you recognize the scripts they belong to, but they're
all in the astral planes.  Using `uniquote -x` on them shows:

\x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522} = 
super
\x{10414}\x{1042F}\x{10445}\x{10428}\x{10449}\x{1042F}\x{1043B} = 
Deseret

\x{10330}\x{10344}\x{10344}\x{10330}\x{203F}\x{1033F}\x{1033D}\x{10343}\x{10330}\x{10342}
  = Gothic our father

As to whether they're proper identifiers per your reference above, I
will take the first letter from each of 픘픫픦픠픬픡픢, ДЯхШщЯл, and ̰̰̈́̈́‿̰̿̽̓͂, 
which are repsectively 픘, Д, and ̰, or

MATHEMATICAL FRAKTUR CAPITAL U
DESERET CAPITAL LETTER DEE
GOTHIC LETTER AHSA

or 

1D518
10414
10330

and show you their full Unicode properties of these reject code points.
This requires the uniprops command, given which, these three commands 
are then completely identical:

% uniprops -ga 픘 Д ̰
% uniprops -ga 1D518 10414 10330
% uniprops -ga MATHEMATICAL FRAKTUR CAPITAL U DESERET CAPITAL LETTER 
DEE GOTHIC LETTER AHSA

and produce this output:

U+1D518 ‹픘› \N{MATHEMATICAL FRAKTUR CAPITAL U}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols 
Cased Cased_Letter LC Changes_When_NFKC_Casefolded
   CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue 
IDC ID_Start IDS Letter L_ Uppercase_Letter Math
   Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word 
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
   X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L 
Block=Mathematical_Alphanumeric_Symbols Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR 
Canonical_Combining_Class=NR General_Category=Cased_Letter Script=Common
   Decomposition_Type=Font DT=Font Decomposition_Type=Non_Canon 
Decomposition_Type=Non_Canonical DT=NonCanon
   East_Asian_Width=Neutral GC=LC General_Category=L 
General_Category=Letter General_Category=L_ General_Category=LC GC=L
   General_Category=Lu General_Category=Uppercase_Letter GC=Lu 
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
   Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA 
Joining_Group=No_Joining_Group JG=NoJoiningGroup
   Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL 
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
   Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 
Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
   Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy
   Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE 
Word_Break=LE _X_Begin
U+10414 ‹Д› \N{DESERET CAPITAL LETTER DEE}
\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC 
Changes_When_Casefolded CWCF Changes_When_Casemapped
   CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF 
Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase
   ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper 
Uppercase Word XID_Continue XIDC XID_Start XIDS
   X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper 
X_POSIX_Word
Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret 
Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR 
Canonical_Combining_Class=NR General_Category=Cased_Letter

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

The Python re library is broken in its approach to case-insensitive matches. It 
erroneously attempts to compare lowercase mappings.  This is wrong. You must 
compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get 
wrong answers.  I include a small test case that illustrates this bug.  The bug 
exists on both 2.7 and 3.2, and on both wide builds and narrow builds.  For 
comparison, I also show results using Matthew Barnett's regex library, which 
gets all 5 tests correct where re gets all 5 tests wrong.

A sample run is:

FAIL: repattern Ι isnot the same as string ͅ
PASS: regex pattern Ι is indeed the same as string ͅ
FAIL: repattern Μ isnot the same as string µ
PASS: regex pattern Μ is indeed the same as string µ
FAIL: repattern ſ isnot the same as string s
PASS: regex pattern ſ is indeed the same as string s
FAIL: repattern ΣΤΙΓΜΑΣ isnot the same as string στιγμας
PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας
FAIL: repattern POST isnot the same as string poſt
PASS: regex pattern POST is indeed the same as string poſt

relib passed 0 of 5 tests
regex lib passed 5 of 5 tests

--
components: Library (Lib)
files: sigmata.python
messages: 141916
nosy: tchrist
priority: normal
severity: normal
status: open
title: Python re lib fails case insensitive matches on Unicode data
versions: Python 2.7
Added file: http://bugs.python.org/file22879/sigmata.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12728
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Python is in flagrant violation of the very most basic premises of Unicode 
Technical Report #18 on Regular Expressions, which requires that a regex engine 
support Unicode characters as basic logical units independent of serialization 
like UTF‑*.  Because sometimes you must specify .. to match a single Unicode 
character -- whenever those code points are above the BMP and you are on a 
narrow build -- Python regexes cannot be reliably used for Unicode text.

 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type help, copyright, credits or license for more information.
  import re
  g = \N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}
  print(g)
ᾲ
  print(re.search(r'\w', g))
 _sre.SRE_Match object at 0x10051f988
  p = \N{MATHEMATICAL SCRIPT CAPITAL P}
  print(p)
풫
  print(re.search(r'\w', p))
None
  print(re.search(r'..', p))   # ← 홏홃홄홎 홄홎 홏홃홀 홑홄홊홇혼홏홄홊홉 홍홄홂홃홏 홃홀홍홀 
_sre.SRE_Match object at 0x10051f988
  print(len(chr(0x1D4AB)))
2

That is illegal in Unicode regular expressions.

--
components: Regular Expressions
messages: 141917
nosy: tchrist
priority: normal
severity: normal
status: open
title: Python lib re cannot handle Unicode properly due to narrow/wide bug
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

You cannot use Python's casemapping functions on Unicode data because they fail 
on narrow builds.  This makes it impossible to write portable code in Python 
that can cope with full Unicode.

I've tried several times to submit this bug, but the file selection widget 
blows up. I believe it was an Opera bug because I had a write lock on the file. 
 One more time.

--
components: Unicode
files: casemaps.python
messages: 141918
nosy: tchrist
priority: normal
severity: normal
status: open
title: Python's casemapping functions are untrustworthy due to narrow/wide 
build issues
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file22880/casemaps.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12730
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

You cannot use Python's lib re for handling Unicode regular expressions because 
it violates the standard set out for the same in UTS#18 on Unicode Regular 
Expressions in RL1.2a on compatibility properties.  What \w is allowed to match 
is clearly explained there, but Python has its own idea. Because it is in clear 
violation of the standard, it is misleading and wrong for Python to claim that 
the re.UNICODE flag makes \w and friends match Unicode.  Here are the failed 
test cases when the attached file is run under v3.2; there are further failures 
when run under v2.7.

FAIL lib refound non alphanumeric string café
FAIL lib refound non alphanumeric string Ⓚ
FAIL lib refound non alphanumeric string ͅ
FAIL lib refound non alphanumeric string ְ
FAIL lib refound non alphanumeric string ퟘ
FAIL lib refound non alphanumeric string ́
FAIL lib refound non alphanumeric string 픘픫픦픠픬픡픢
FAIL lib refound non alphanumeric string ДЯхШщЯл
FAIL lib refound non alphanumeric string connector‿punctuation
FAIL lib refound non alphanumeric string Ὰͅ_Στο_Διάολο
FAIL lib refound non alphanumeric string ̰̰̈́̈́‿̰̿̽̓͂‿̸̿‿̹̽‿̷̹̼̹̰̼̽
FAIL lib refound all alphanumeric string ¹²³
FAIL lib refound all alphanumeric string ₁₂₃
FAIL lib refound all alphanumeric string ¼½¾
FAIL lib refound all alphanumeric string ⑶

Note that Matthew Barnett's regex lib for Python handles all of these cases in 
comformance with The Unicode Standard.

--
components: Regular Expressions
files: alnum.python
messages: 141920
nosy: tchrist
priority: normal
severity: normal
status: open
title: python lib re uses obsolete sense of \w in full violation of UTS#18 
RL1.2a
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file22881/alnum.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12731
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12732] Can't portably use Unicode in Python identifiers

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

You cannot reliably use Unicode in Python identifiers because of the 
narrow/wide build issue.  The enclosed file is fine on wide builds but gets 
compiler errors on narrow ones during compilation.

Go, Ruby, Java, and Perl all handle this situation without any problem; only 
Python has the bug.

--
components: Interpreter Core
files: badidents.python
messages: 141923
nosy: tchrist
priority: normal
severity: normal
status: open
title: Can't portably use Unicode in Python identifiers
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file22882/badidents.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12732
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12728] Python re lib fails case insensitive matches on Unicode data

2011-08-11 Thread Tom Christiansen


Changes by Tom Christiansen tchr...@perl.com:


--
components: +Regular Expressions -Library (Lib)
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12728
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12733] Request for grapheme support in Python re lib

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Without proper grapheme support in the regular expression library, it is 
impossible to correctly process Unicode.  And the very least, one needs the \X 
escape supported, which is an extended grapheme cluster per UTS#18. This escape 
is supported by many regex libraries, include Perl's own and of course PCRE 
(and thence PHP, the standard ICU library, and Matthew Barnett's replacement 
regex library for Python.

How do you process a string by graphemes if you cannot split on \X?  How can 
you avoid splitting a grapheme into silly pieces if you cannot match one?  How 
do I match the letter O no matter what diacritics have been applied to it 
otherwise?  A match of (?=O)\X against an NFD string is by far the simplest and 
best way.

This is necessary for a wide variety of reasons.  Adding \pM and \PM go a 
little ways, but not far enough, because that is not how grapheme clusters are 
defined.  You need a proper \X.

--
components: Regular Expressions
messages: 141924
nosy: tchrist
priority: normal
severity: normal
status: open
title: Request for grapheme support in Python re lib
type: feature request
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12733
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Python supports no Unicode properties in its re library, making it unsuitable 
for work with Unicode. This is therefore a formal request for the Python re 
library to support Unicode properties.

 The eleven properties required by Unicode Technical Report #18's RL1.2 are the 
bare minimum which must be added to make it possible to use Python reguyar 
expressions on Unicode. 

The proposed RL2.7 on Full Properties is even better.  That is found at

  http://unicode.org/reports/tr18/proposed.html#Full_Properties

Although by the time you read this, it will have been made an official part of 
tr18.

Matthew Barnett's replacement library for re, called regex, support 67 Unicode 
properties at last count, including the strongly recommended loose matching.  

The standard re library needs to be spiffed up to make it suitable for Unicode 
processing; it is not currently usable for that due to this missing 
functionality.  I quote from the Level 1 conformance requirement of tr18:

Level 1: This is a minimal level for useful Unicode support. It does not 
account for end-user expectations for character support, but does satisfy most 
low-level programmer requirements. The results of regular expression matching 
at this level are independent of country or language. At this level, the user 
of the regular expression engine would need to write more complicated regular 
expressions to do full Unicode processing.

pass RL1.1 Hex Notation
fail RL1.2 Properties
fail RL1.2a Compatibility Properties
fail RL1.3 Subtraction and Intersection
fail RL1.4 Simple Word Boundaries
fail RL1.5 Simple Loose Matches
fail RL1.6 Line Boundaries
fail RL1.7 Supplementary Code Points

(withdrawn) RL2.1 Canonical Equivalents
fail RL2.2 Extended Grapheme Clusters
fail RL2.3 Default Word Boundaries
fail RL2.4 Default Case Conversion
pass RL2.5 Name Properties
fail RL2.6 Wildcards in Property Values
fail RL2.7 Full Properties

I won’t even talk about Level 3.  

ICU, Perl, and Java7 all meet Level One conformance requirements with several 
Level 2 requirements also met.  It is important for Python to meet the Unicode 
Standard in this so that people can use Python for regex matching Unicode text. 
 They currently cannot usefully do so per the requirements of tr18.

--
components: Regular Expressions
messages: 141925
nosy: tchrist
priority: normal
severity: normal
status: open
title: Request for property support in Python re lib
type: feature request
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12735] request full Unicode collation support in std python library

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Python has no standard support for the Unicode Collation Library as explained 
in UTS #10.  This is request that UCA library be added to the standard Python 
distribution.

Collation underlies virtually everything we do with text, not just sorting but 
any sort of comparison. Furthermore, the UCA is tailorable for locales in a 
portable way that does not require dodgy vendor support. It is a very important 
step in making Python suitable for full Unicode text processing.

--
components: Library (Lib)
messages: 141926
nosy: tchrist
priority: normal
severity: normal
status: open
title: request full Unicode collation support in std python library
type: feature request
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12735
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

2011-08-11 Thread Tom Christiansen


New submission from Tom Christiansen tchr...@perl.com:

Python's casemapping functions only use what Unicode calls simple casemaps. 
These are only appropriate for functions that operate on single characters 
alone, not for those that operate on strings. The reason for this is that you 
get much better results with full casemapping. Java, Ruby, and Perl all do full 
casemapping for their equivalent functions that do string mapping, and Python 
should, too.

I include a program that has a much of mappings and foldings both simple and 
full.  Yes, it was machine-generated.

--
components: Library (Lib)
files: mux.python
messages: 141928
nosy: tchrist
priority: normal
severity: normal
status: open
title: Request for python casemapping functions to use full not simple casemaps 
per Unicode's recommendation
type: feature request
versions: Python 3.2
Added file: http://bugs.python.org/file22883/mux.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12736
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12737] string.title() is overzealous by upcasing combining marks inappropriately

2011-08-11 Thread Tom Christiansen

New submission from Tom Christiansen tchr...@perl.com:

Python's string.title() function claims it titlecases the first letter in each
word and lowercases the rest. However, this is not true. It is not using
either of the two word detection algorithms that Unicode provides. One allows
you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number,
or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and
the other uses the more sophisticated word-break provided by the Word_Break
properties such as Word_Break=MidNumLet

Python is using neither of these, so gets the wrong answer.

titlecase of déme un café should be Déme Un Café not DéMe Un Café
titlecase of i̇stanbul should be İstanbul not İStanbul
titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο

Because those are in NFD form, you get different answers than if they are in
NFC. That is not right. You should get the same answer. The bug is you aren't
using the right definition for \w, and so get screwed up. This is likely
related to issue 12731.

In the enclosed tester file, which fails 4 out of its 6 tests, there is also a
bug shown with this failed result:

titlecase of мЯхШщЯл should be ДЯхШщЯл not мЯхШщЯл

That one is related to issue 12730.

See the attached tester, which was run under Python 3.2. As far as I can tell,
these bugs exist in all python versions.

--
files: titletest.python
messages: 141929
nosy: tchrist
priority: normal
severity: normal
status: open
title: string.title() is overzealous by upcasing combining marks
inappropriately
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file22884/titletest.python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12737
___
___
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12734] Request for property support in Python re lib

2011-08-11 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I've been a lot of testing of Matthew's regex library against UTS#18 issues, 
but only somewhat incidentally testing re. To use regex, one has to accept that 
certain things will work differently than they work in re, because he is 
following Unicode definitions for things like casefolding. 

But I doubt that is the sort of difference you are talking about. One of the 
things that Java, Go, and Perl all do is run regression tests against the whole 
Unicode Character Database to make sure nothing gets hosed, missed, or 
otherwise out of sync. That might a sort of regression test you might like to 
add.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12568] Add functions to get the width in columns of a character

2011-08-11 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

I can attest that being able to get the columns of a grapheme cluster is very 
important for printing, because you need this to do correct linebreaking.  
There might be something you can steal from 

   http://search.cpan.org/perldoc?Unicode::GCString
   http://search.cpan.org/perldoc?Unicode::LineBreak

which implements UAX#14 on linebreaking and UAX#11 on East Asian widths.  

I use this in my own code to help format Unicode strings my columns or lines.  
The right way would be to build this sort of knowledge into string.format(), 
but that is much harder, so an intermediary library module seems good enough 
for now.

--
nosy: +tchrist

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12568
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11230] Full unicode import system not in 3.2

2011-08-11 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

How does this work for modules that have filesystem names different from the 
one used for import? The issue I'm thinking about is that the Mac HSF+ 
filesystem keeps its Unicode filenames in (close to) NFD form. That means that 
a module named caf\N{LATIN SMALL LETTER E WITH ACUTE} with 4 graphemes and 4 
code points in its name winds up in the filesystem as cafe\N{COMBINING ACUTE 
ACCENT} still with 4 graphemes but now with 5 code points.

I believe (well, suspect; I have empirical evidence not proof) Python stores 
its own identifiers in NFD, so this may not be quite as much of a problem as it 
might otherwise be.  Nonetheless, I have had users complain about what HFS+ 
does with such filenames, although I am not quite sure why. I think it’s 
because they access a file with 4 chars but they need a 5-char fileglob to 
wildcard it, so touch caf\N{LATIN SMALL LETTER E WITH ACUTE} and then you 
need a wildcard of ? with an extra ? to find it. Kinda weird.

--
nosy: +tchrist

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11230
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] add codec for java modified utf-8

2011-08-11 Thread Tom Christiansen


Tom Christiansen tchr...@perl.com added the comment:

Please do not call this utf-8-java. It is called cesu-8 per UTS#18 at:

  http://unicode.org/reports/tr26/

CESU-8 is *not* a a valid Unicode Transform Format and should not be called 
UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode 
mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able 
to read it, but call it what it is, please.

Despite the talk about Lucene, I note that the Perl port of Lucene uses real 
UTF-8, not CESU-8.

--
nosy: +tchrist

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2857
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

80 matches

Mail list logo