Re: [Python-Dev] Python and the Unicode Character Database

2010-12-07 Thread Vlastimil Brom
2010/12/7 Alexander Belopolsky alexander.belopol...@gmail.com:
 On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 I actually wonder if Python's re module can claim to provide even
 Basic Unicode Support.

 Do you really wonder? Most definitely it does not.


 Were you more optimistic four years ago?

 http://bugs.python.org/issue1528154#msg54864

 I was hoping that regex syntax would be useful in
 explaining/documenting Python text processing routines (including
 string to number conversions) without a heavy dose of Unicode
 terminology.


The new regex version
http://bugs.python.org/issue2636
supports much more features, including unicode properties, and the
mentioned possix classes etc. but definitely not all of the
requirements of that rather generous list.
http://www.unicode.org/reports/tr18/
It seems, e.g. in Perl, there are some omissions too
http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

Do you know of any re engine fully complying to to tr18, even at the
first level: Basic Unicode Support?

vbr
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-07 Thread Alexander Belopolsky
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com wrote:
..
 It seems, e.g. in Perl, there are some omissions too
 http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

 Do you know of any re engine fully complying to to tr18, even at the
 first level: Basic Unicode Support?

I would say Perl comes very close.  At least it explicitly documents
the missing features and offers workarounds in its reference manual.
I am actually not as concerned about missing features as I am about
non-conformance in the widely used features such as digits' matching
with '\d'.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-07 Thread Martin v. Löwis
Am 07.12.2010 04:03, schrieb Alexander Belopolsky:
 On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 I actually wonder if Python's re module can claim to provide even
 Basic Unicode Support.

 Do you really wonder? Most definitely it does not.

 
 Were you more optimistic four years ago?
 
 http://bugs.python.org/issue1528154#msg54864

Not at all. I thought back then, and think now, that Python should,
but doesn't, support TR#18. I don't view that lack as a severe problem,
though, and apparently none of the other contributors did so, either.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-07 Thread Alexander Belopolsky
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com wrote:
..
 Do you know of any re engine fully complying to to tr18, even at the
 first level: Basic Unicode Support?


ICU Regular Expressions conform to Unicode Technical Standard #18 ,
Unicode Regular Expressions, level 1, and in addition include Default
Word boundaries and Name Properties from level 2.
 http://userguide.icu-project.org/strings/regexp
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-07 Thread Vlastimil Brom
2010/12/7 Alexander Belopolsky alexander.belopol...@gmail.com:
 On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com 
 wrote:
 ..
 Do you know of any re engine fully complying to to tr18, even at the
 first level: Basic Unicode Support?

 
 ICU Regular Expressions conform to Unicode Technical Standard #18 ,
 Unicode Regular Expressions, level 1, and in addition include Default
 Word boundaries and Name Properties from level 2.
  http://userguide.icu-project.org/strings/regexp


Thanks for the pointer, I wasn't aware of that project.
Anyway I am quite happy with the mentioned regex library and can live
with trading this full compliance for some non-unicode goodies (like
unbounded lookbehinds etc.), but I see, it's beyond the point here.
Not that my opinion matters, but I can't think of, say, union,
intersection and set-difference of Unicode sets as a basic feature or
consider it a part of a minimal level for useful Unicode support.

vbr
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-06 Thread Alexander Belopolsky
On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 I actually wonder if Python's re module can claim to provide even
 Basic Unicode Support.

 Do you really wonder? Most definitely it does not.


Were you more optimistic four years ago?

http://bugs.python.org/issue1528154#msg54864

I was hoping that regex syntax would be useful in
explaining/documenting Python text processing routines (including
string to number conversions) without a heavy dose of Unicode
terminology.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-04 Thread Stephen J. Turnbull
Antoine Pitrou writes:
  Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a
  écrit :
   Antoine Pitrou writes:
   
 The legacy format argument looks like a red herring to me. When
 converting from a format to another it is the programmer's job to
 his/her job right.
   
   Uhmm, the argument *for* this feature proposed by several people
   is that Python's numeric constructors do it (right) so that the
   programmer doesn't have to.
  
  As far as I understand, Alexander was talking about a legacy pre-unicode
  text format. We don't have to support this.

*I* didn't say we *should* support it.  I'm saying that *others'*
argument for not restricting the formats accepting by string to number
converters to something well-defined and AFAIK universally understood
by users (developers of Python programs *and* end-users) is that we
*already* support this.

Alexander, Martin, and I are basically just pointing out that no, the
support we have via the built-in numeric constructors is incomplete
and nonconforming.  We feel that is a bug to be fixed by (1)
implementing the definition as currently found in the documents, and
(2) moving the non-ASCII support to another module (or, as a
compromise, supporting non-ASCII digits via an argument to the
built-ins -- that was my proposal, I don't know if Alexander or Martin
would find it acceptable).

Given that some committers (MAL, you?) don't even consider that
accepting and converting a string containing digits from multiple
scripts as a single number is a bug, I'd really rather that this
bug/feature not be embedded in the interpreter.  I suppose that as a
built-in rather than syntax, technically it doesn't fall under the
moratorium, but it makes me nervous
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-04 Thread Antoine Pitrou
Le samedi 04 décembre 2010 à 17:13 +0900, Stephen J. Turnbull a écrit :
 Antoine Pitrou writes:
   Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a
   écrit :
Antoine Pitrou writes:

  The legacy format argument looks like a red herring to me. When
  converting from a format to another it is the programmer's job to
  his/her job right.

Uhmm, the argument *for* this feature proposed by several people
is that Python's numeric constructors do it (right) so that the
programmer doesn't have to.
   
   As far as I understand, Alexander was talking about a legacy pre-unicode
   text format. We don't have to support this.
 
 *I* didn't say we *should* support it.  I'm saying that *others'*
 argument for not restricting the formats accepting by string to number
 converters to something well-defined and AFAIK universally understood
 by users (developers of Python programs *and* end-users) is that we
 *already* support this.

As far as I can parse your sentence, I think you are mistaken.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-04 Thread Alexander Belopolsky
On Fri, Dec 3, 2010 at 12:10 AM, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
..
 I don't think decimal module should support non-European decimal
 digits.  The only place where it can make some sense is in int()
 because here we have a fighting chance of producing a reasonable
 definition.   The motivating use case is conversion of numerical data
 extracted from text using simple '\d+'  regex matches.


It turns out, this use case does not quite work in Python either:

 re.compile(r'\s+(\d+)\s+').match(' \u2081\u2082\u2083   ').group(1)
'₁₂₃'
 int(_)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\u2081' in
position 0: invalid decimal Unicode string

This may actually be a bug in Python regex implementation because
Unicode standard seems to recommend that '\d' be interpreted as gc =
Decimal_Number (Nd):

http://unicode.org/reports/tr18/#Compatibility_Properties

I actually wonder if Python's re module can claim to provide even
Basic Unicode Support.

http://unicode.org/reports/tr18/#Basic_Unicode_Support

 Here is how I would do it:

 1.  String x of non-European decimal digits is only accepted in
 int(x), but not by int(x, 0) or int(x, 10).
 2.  If x contains one or more non-European digits, then

    (a)  all digits must be from the same block:

      def basepoint(c):
            return ord(c) - unicodedata.digit(c)
      all(basepoint(c) == basepoint(x[0]) for c in x) - True

     (b) and '+' or '-' sign is not alowed.

 3. A character c is a digit if it matches '\d' regex.  I think this
 means unicodedata.category(c) - 'Nd'.

 Condition 2(b) is important because there is no clear way to define
 what is acceptable as '+' or '-' using Unicode character properties
 and not all number systems even support local form of negation.  (It
 is also YAGNI.)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-04 Thread Martin v. Löwis
 I actually wonder if Python's re module can claim to provide even
 Basic Unicode Support.

Do you really wonder? Most definitely it does not.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-03 Thread Neil Hodgson
Stephen J. Turnbull:

 Will it accept Arabic on input?  (Han might be too much to ask for
 since Unicode considers Han digits to be impure.)

   I couldn't find a direct way to input Arabic digits into OO Calc,
the normal use of Alt+number didn't work in Calc although it did in
WordPad where Alt+1632 is ٠ and so on.

   OO Calc does have settings in the Complex Text Layout section for
choosing different numerals but I don't understand the interaction of
choices here.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-03 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 I will change my mind on this issue when you present a
 machine-readable file with Arabic-Indic numerals and a program capable
 of reading it and show that this program uses the same number parsing
 algorithm as Python's int() or float().

 Have you had a look at the examples I posted ? They include texts
 and tables with numbers written using east asian arabic numerals.
 
 Yes, but this was all about output.  I am pretty sure TeX was able to
 typeset Qur'an in all its glory long before Unicode was invented.
 Yet, in machine readable form it would be something like {\quran 1}
 (invented directive).   I have asked for a file that is intended for
 machine processing, not for human enjoyment in print or on a display.
  I claim that if such file exists, the program that reads it does not
 use the same rules as Python and converting non-ascii digits would be
 a tiny portion of what that program does.

Well, programs that take input from the keyboards I posted in this
thread will have to deal with the digits. Since Python's input()
accepts keyboard input, you have your use case :-)

Seriously, I find the distinction between input and output forms
of numerals somewhat misguided. Any output can also serve as input.
For books and other printed material, images, etc. you have scanners
and OCR. For screen output you have screen readers. For spreadsheets
and data, you have CSV, TSV, XML, etc. etc. etc.

Just for the fun of it, I created a CSV file with Thai and Dzongkha
numerals (in addition to Arabic ones) using OpenOffice. Here's the
cut and paste version:


Numbers in various scripts  

Arabic  ThaiDzongkha
1   ๑   ༡
2   ๒   ༢
3   ๓   ༣
4   ๔   ༤
5   ๕   ༥
6   ๖   ༦
7   ๗   ༧
8   ๘   ༨
9   ๙   ༩
10  ๑๐  ༡༠
11  ๑๑  ༡༡
12  ๑๒  ༡༢
13  ๑๓  ༡༣
14  ๑๔  ༡༤
15  ๑๕  ༡༥
16  ๑๖  ༡༦
17  ๑๗  ༡༧
18  ๑๘  ༡༨
19  ๑๙  ༡༩
20  ๒๐  ༢༠


And here's the script that goes with it:

import csv
c = csv.reader(open('Numbers-in-various-scripts.csv'))
headers = [c.next() for i in range(3)]
while c:
print [int(unicode(x, 'utf-8')) for x in c.next()]

and the output using Python 2.7:

[1, 1, 1]
[2, 2, 2]
[3, 3, 3]
[4, 4, 4]
[5, 5, 5]
[6, 6, 6]
[7, 7, 7]
[8, 8, 8]
[9, 9, 9]
[10, 10, 10]
[11, 11, 11]
[12, 12, 12]
[13, 13, 13]
[14, 14, 14]
[15, 15, 15]
[16, 16, 16]
[17, 17, 17]
[18, 18, 18]
[19, 19, 19]
[20, 20, 20]

If you need more such files, I can generate as many as you like ;-)
I can send the OOo file as well, if you like to play around with it.

I'd say: case closed :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 03 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
Numbers in various scripts,,
,,
Arabic,Thai,Dzongkha
1,๑,༡
2,๒,༢
3,๓,༣
4,๔,༤
5,๕,༥
6,๖,༦
7,๗,༧
8,๘,༨
9,๙,༩
10,๑๐,༡༠
11,๑๑,༡༡
12,๑๒,༡༢
13,๑๓,༡༣
14,๑๔,༡༤
15,๑๕,༡༥
16,๑๖,༡༦
17,๑๗,༡༧
18,๑๘,༡༨
19,๑๙,༡༩
20,๒๐,༢༠
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-03 Thread Antoine Pitrou
Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a
écrit :
 Antoine Pitrou writes:
 
   The legacy format argument looks like a red herring to me. When
   converting from a format to another it is the programmer's job to
   his/her job right.
 
 Uhmm, the argument *for* this feature proposed by several people
 is that Python's numeric constructors do it (right) so that the
 programmer doesn't have to.

As far as I understand, Alexander was talking about a legacy pre-unicode
text format. We don't have to support this.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Neil Hodgson
Stephen J. Turnbull:

 Here's why: '''print %d %
 some_integer''' doesn't now, and never will (unless Kristan gets his
 Python 2.8wink), produce Arabic or Han numerals.  Not in any
 language I know of, not in Microsoft Excel, and definitely not in
 Python 2.

   While I don't have Excel to test with, OpenOffice.org Calc will
display in Arabic or Han numerals using the NatNum format codes.
http://www.scintilla.org/ArabicNumbers.png

 Ditto Arabic, I
 would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the
 Arabic digits that have been presented here earlier AFAICT.  Note that
 there's plenty of space for them in that code table (eg, 0xB0-0xB9 is
 empty).  Apparently nobody *ever* thought it was useful to have them!

   DOS code page 864 does use 0xB0-0xB9 for ٠ .. ٩.
http://www.ascii.ca/cp864.htm

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Georg Brandl
Am 01.12.2010 23:39, schrieb Martin v. Löwis:
 As of today, What’s New In Python 3.2 [1] does not even mention the
 unicodedata upgrade to 6.0.0.
 
 One reason was that I was instructed not to change What's New a few
 years ago.

Maybe all past, present and future whatsnew maintainers can agree on these
rules, which I copied directly from whatsnew/3.2.rst?

   Rules for maintenance:

   * Anyone can add text to this document.  Do not spend very much time
   on the wording of your changes, because your text will probably
   get rewritten to some degree.

   * The maintainer will go through Misc/NEWS periodically and add
   changes; it's therefore more important to add your changes to
   Misc/NEWS than to this file.

   * This is not a complete list of every single change; completeness
   is the purpose of Misc/NEWS.  Some changes I consider too small
   or esoteric to include.  If such a change is added to the text,
   I'll just remove it.  (This is another reason you shouldn't spend
   too much time on writing your addition.)

   * If you want to draw your new text to the attention of the
   maintainer, add 'XXX' to the beginning of the paragraph or
   section.

   * It's OK to just add a fragmentary note about a change.  For
   example: XXX Describe the transmogrify() function added to the
   socket module.  The maintainer will research the change and
   write the necessary text.

   * You can comment out your additions if you like, but it's not
   necessary (especially when a final release is some months away).

   * Credit the author of a patch or bugfix.   Just the name is
   sufficient; the e-mail address isn't necessary.  It's helpful to
   add the issue number:

 XXX Describe the transmogrify() function added to the socket
 module.

 (Contributed by P.Y. Developer; :issue:`12345`.)

   This saves the maintainer the effort of going through the SVN log
   when researching a change.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Lennart Regebro
2010/12/2 Stephen J. Turnbull step...@xemacs.org:
 Because that works, but

 print(T1234)

 doesn't (it prints ASCII).  You can't round-trip, but users will
 want/expect that.

You should be able to round-trip, absolutely. I don't think you should
expect print() to do that. str(56) possibly. :)
That's an argument for it to be in a module, as you then would need to
send in a parameter on which decimal characters you want.

 T1000 = float('一.◯◯◯')

That was already discussed here, and it's clear that unicode does not
consider these characters to be something you can use in a decimal
number, and hence it's not broken.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Antoine Pitrou
On Wed, 1 Dec 2010 22:28:49 -0500
Alexander Belopolsky alexander.belopol...@gmail.com wrote:
 
  Both my personal observations when travelling from Turkey to India and
  Wikipedia say yes. When representing a number in Arabic, the lowest-valued
  position is placed on the right, so the order of positions is the same as in
  left-to-right scripts.
  https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals
 
 This matches my limited research on this topic as well.  However, I am
 not sure that when these codes are embedded in Arabic text, their
 logical order always matches their display order.

That shouldn't matter, since unicode text follows logical order. The
display order is up to the graphical representation library.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou solip...@pitrou.net wrote:
 On Wed, 1 Dec 2010 22:28:49 -0500
 Alexander Belopolsky alexander.belopol...@gmail.com wrote:
..
 This matches my limited research on this topic as well.  However, I am
 not sure that when these codes are embedded in Arabic text, their
 logical order always matches their display order.

 That shouldn't matter, since unicode text follows logical order. The
 display order is up to the graphical representation library.


I am not so sure.  On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9
and Arabic-Indic decimals differently:

 print('\u200F123')
‏123
 print('\u200F\u0661\u0662\u0663')
231

I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate
the point.  Cut-n-paste does not work well in the presence of RTL
directives.

and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both:

 print('\u202E123')
321
 print('\u202E\u0661\u0662\u0663')
321

(again, the output display is simulated not copied.)  I don't know if
explicit RTL directives are ever used in Arabic texts, but it is quite
possible that texts converted from older formats would use them for
efficiency.

Note that my point is not to find the correct answer here, but to
demonstrate that we as a group don't have the expertise to get parsing
of Arabic text right.  If we've got it right for Arabic, it is by
chance and not by design.  This still leaves us with 41 other types of
digits for at least 30 different languages.  Nobody will ever assume
that python builtins are suitable for use with all these variants.
This feature is only good for nefarious purposes such as hiding
extra digits in innocent-looking files or smuggling binary data
through naive interfaces.

PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it
int('\u06DD\u0661\u0662')?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Antoine Pitrou
Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :
 
 Note that my point is not to find the correct answer here, but to
 demonstrate that we as a group don't have the expertise to get parsing
 of Arabic text right.

I don't understand why you think Arabic or Hebrew text is any different
from Western text. Surely right-to-left isn't more conceptually
complicated than left-to-right, is it?

The fact that mixed rtl + ltr can render bizarrely or is awkward to cut
and paste is quite off-topic for our discussion.

 If we've got it right for Arabic, it is by
 chance and not by design.  This still leaves us with 41 other types of
 digits for at least 30 different languages.

So why do you trust the Unicode standard on other things and not on this
one?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 11:56 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :

 Note that my point is not to find the correct answer here, but to
 demonstrate that we as a group don't have the expertise to get parsing
 of Arabic text right.

 I don't understand why you think Arabic or Hebrew text is any different
 from Western text. Surely right-to-left isn't more conceptually
 complicated than left-to-right, is it?


No, but a mix of LTR and RTL is certainly more difficult that either
of the two.  I invite you to digest Unicode Standard Annex #9 before
we continue this discussion.

See http://unicode.org/reports/tr9/.


 The fact that mixed rtl + ltr can render bizarrely or is awkward to cut
 and paste is quite off-topic for our discussion.


No, it is not.  One of the invented use cases in this thread was naive
users' desire to enter numbers using their preferred local decimals.
Same users may want to be able to cut and paste their decimals as
well.  More importantly, however, legacy formats may not have support
for mixed-direction text and may require that John is 41 be stored
as 41 si nhoJ and Unicode converter would turn it into [RTL]John is
14  that will still display as  41 si nhoJ, but int(s[-2:]) will
return 14, not 41.

 If we've got it right for Arabic, it is by
 chance and not by design.  This still leaves us with 41 other types of
 digits for at least 30 different languages.

 So why do you trust the Unicode standard on other things and not on this
 one?

What other things? As far as I understand the only str method that was
designed to comply with Unicode recomendations was str.isidentifier().
 And we have some really bizarre results:


 '\u2164'.isidentifier()
True
 '\u2164'.isalpha()
False

and can you describe the difference between str.isdigit() and
str.isdecimal()?  According to the reference manual,


str.isdecimal()
Return true if all characters in the string are decimal characters and
there is at least one character, false otherwise. Decimal characters
include digit characters, and all characters that that can be used to
form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.

str.isdigit()
Return true if all characters in the string are digits and there is at
least one character, false otherwise.
 http://docs.python.org/dev/library/stdtypes.html#str.isdecimal

Since U+0660 is mentioned in the first definition and not in the
second, I may conclude that it is not a digit, but

 '\u0660'.isdigit()
True

If you know the correct answer, please contribute it here:
http://bugs.python.org/issue10587.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Antoine Pitrou
Le jeudi 02 décembre 2010 à 13:14 -0500, Alexander Belopolsky a écrit :
  I don't understand why you think Arabic or Hebrew text is any different
  from Western text. Surely right-to-left isn't more conceptually
  complicated than left-to-right, is it?
 
 
 No, but a mix of LTR and RTL is certainly more difficult that either
 of the two.  I invite you to digest Unicode Standard Annex #9 before
 we continue this discussion.
 
 See http://unicode.org/reports/tr9/.

“This annex describes specifications for the *positioning* of characters
flowing from right to left” (emphasis mine)

Looks like something for implementors of rendering engines, which
python-dev is not AFAICT.

 Same users may want to be able to cut and paste their decimals as
 well.  More importantly, however, legacy formats may not have support
 for mixed-direction text and may require that John is 41 be stored
 as 41 si nhoJ and Unicode converter would turn it into [RTL]John is
 14  that will still display as  41 si nhoJ, but int(s[-2:]) will
 return 14, not 41.

The legacy format argument looks like a red herring to me. When
converting from a format to another it is the programmer's job to
his/her job right.

  If we've got it right for Arabic, it is by
  chance and not by design.  This still leaves us with 41 other types of
  digits for at least 30 different languages.
 
  So why do you trust the Unicode standard on other things and not on this
  one?
 
 What other things?

Everything which the Unicode database stores and that we already rely
on.

 As far as I understand the only str method that was
 designed to comply with Unicode recomendations was str.isidentifier().

I don't think so.  str.split() and str.splitlines() are also defined in
conformance to the SPEC, AFAIK.  They certainly try to.
And, outside of str itself, the re module tries to follow Unicode
categories as well (for example, \d should match non-ASCII digits).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
Am 02.12.2010 03:01, schrieb Ben Finney:
 Stephen J. Turnbull step...@xemacs.org writes:
 
 Furthermore, he provided good *objective* reason (excessive cost, to
 which I can also testify, in several different input methods for
 Japanese) why numbers simply would not be input that way.

 What's left is copy/paste via the mouse.
 
 For direct entry by an interactive user, yes. Why are some people in
 this discussion thinking only of direct entry by an interactive user?

Ultimately, somebody will have entered the data.

 Input from an existing text file, as I said earlier.

Which *specific* existing text file? Have you actually *seen* such a
text file?

 Direct entry at the console is a red herring.

And we don't need powerhouses because power comes out of the socket.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
 Maybe all past, present and future whatsnew maintainers can agree on these
 rules, which I copied directly from whatsnew/3.2.rst?

I don't think all past maintainers can (I'm pretty certain that AMK
would disagree), but if that's the current policy, I can certainly try
following it (I didn't know it exists because I never look at the file).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Now, one may wonder what precisely a possibly signed floating point
 number is, but most likely, this refers to

 floatnumber   ::=  pointfloat | exponentfloat
 pointfloat::=  [intpart] fraction | intpart .
 exponentfloat ::=  (intpart | pointfloat) exponent
 intpart   ::=  digit+
 fraction  ::=  . digit+
 exponent  ::=  (e | E) [+ | -] digit+
 digit  ::=  0...9

 I don't see why the language spec should limit the wealth of number
 formats supported by float().
 
 If it doesn't, there should be some other specification of what
 is correct and what is not. It must not be unspecified.

True.

 It is not uncommon for Asians and other non-Latin script users to
 use their own native script symbols for numbers. Just because these
 digits may look strange to someone doesn't mean that they are
 meaningless or should be discarded.
 
 Then these users should speak up and indicate their need, or somebody
 should speak up and confirm that there are users who actually want
 '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
 system in which '١٢٣٤.٥٦e4' means 12345600.0.

I'm not sure what you're after here.

 Please also remember that Python3 now allows Unicode names for
 identifiers for much the same reasons.
 
 No no no. Addition of Unicode identifiers has a well-designed,
 deliberate specification, with a PEP and all. The support for
 non-ASCII digits in float appears to be ad-hoc, and not founded
 on actual needs of actual users.

Please note that we didn't have PEPs and the PEP process at the
time. The Unicode proposal predates and in some respects inspired
the PEP process.

The decision to add this support was deliberate based on the desire
to support as much of the nice features of Unicode in Python as
we could. At least that was what was driving me at the time.

Regarding actual needs of actual users: I don't buy that as an
argument when it comes to supporting a standard that is meant
to attract users with non-ASCII origins.

Some references you may want to read up on:

http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
http://en.wikipedia.org/wiki/Vietnamese_numerals
http://en.wikipedia.org/wiki/Korean_numerals
http://en.wikipedia.org/wiki/Japanese_numerals

Even MS Office supports them:

http://languages.siuc.edu/Chinese/Language_Settings.html

 Note that the support in float() (and the other numeric constructors)
 to work with Unicode code points was explicitly added when Unicode
 support was added to Python and has been available since Python 1.6.
 
 That doesn't necessarily make it useful. Alexander's complaint is that
 it makes Python unstable (i.e. changing as the UCD changes).

If that were true, then all Unicode database (UCD) changes would make
Python unstable. However, most changes to existing code points in the UCS
are bug fixes, so they actually have a stabilizing quality more than
a destabilizing one.

 It is not a bug by any definition of bug
 
 Most certainly it is: the documentation is either underspecified,
 or deviates from the implementation (when taking the most plausible
 interpretation). This is the very definition of bug.

The implementation is not a bug and neither was this a bug in the
2.x series of the Python documentation. The Python 3.x docs apparently
introduced a reference to the language spec which is clearly not
capturing the wealth of possible inputs.

So, yes, we're talking about a documentation bug, but not an
implementation bug.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Georg Brandl
Am 02.12.2010 20:40, schrieb Martin v. Löwis:
 Maybe all past, present and future whatsnew maintainers can agree on these
 rules, which I copied directly from whatsnew/3.2.rst?
 
 I don't think all past maintainers can

Yes, and the same goes for the future ones, since they may not even know yet
that they will be whatsnew maintainers.  Or maybe they aren't born yet (let's
hope for a long life of Python 3...).

 (I'm pretty certain that AMK
 would disagree), but if that's the current policy, I can certainly try
 following it (I didn't know it exists because I never look at the file).

The large chunk of rules appeared in 2.6, where AMK still was maintainer.
But even in the whatsnew for 2.4, there is this:

.. Don't write extensive text for new sections; I'll do that.
.. Feel free to add commented-out reminders of things that need
.. to be covered.  --amk

But in any case, they are certainly valid for the current whatsnew -- even
if Raymond likes to grumble about too expansive commits :)

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
 Then these users should speak up and indicate their need, or somebody
 should speak up and confirm that there are users who actually want
 '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
 system in which '١٢٣٤.٥٦e4' means 12345600.0.
 
 I'm not sure what you're after here.

That the current float() constructor accepts tons of bogus character
strings and accepts them as numbers, and that it should stop doing so.

 The decision to add this support was deliberate based on the desire
 to support as much of the nice features of Unicode in Python as
 we could. At least that was what was driving me at the time.

At the time, this may have been the right thing to do. With the
experience gained, we should now conclude to revert this particular aspect.

 Some references you may want to read up on:
 
 http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
 http://en.wikipedia.org/wiki/Vietnamese_numerals
 http://en.wikipedia.org/wiki/Korean_numerals
 http://en.wikipedia.org/wiki/Japanese_numerals

I don't question that people use non-ASCII characters to
denote numbers. I claim that the specific support in Python for that
has no connection to reality. I further claim that the use of non-ASCII
numbers is a local convention, and that if you provide a library to
parse numbers, users (of that library) will somehow have to specify
which notational convention(s) is reasonable for the input they have.

 Even MS Office supports them:
 
 http://languages.siuc.edu/Chinese/Language_Settings.html

That's printing, though, not parsing.

Notice that Python does *not* currently support printing numbers in
other scripts - even though this may actually be more useful than
parsing.

 Note that the support in float() (and the other numeric constructors)
 to work with Unicode code points was explicitly added when Unicode
 support was added to Python and has been available since Python 1.6.

 That doesn't necessarily make it useful. Alexander's complaint is that
 it makes Python unstable (i.e. changing as the UCD changes).
 
 If that were true, then all Unicode database (UCD) changes would make
 Python unstable.

That's indeed the case - they do (see the recent bug report on white
space processing). However, any change makes Python unstable (in the
sense that it can potentially break existing applications), and, in
many cases, the risk of breaking something is well worth it.

In the case of number parsing, I think Python would be better if
float() rejected non-ASCII strings, and any support for such parsing
should be redone correctly in a different place (preferably along with
printing of numbers).

 Most certainly it is: the documentation is either underspecified,
 or deviates from the implementation (when taking the most plausible
 interpretation). This is the very definition of bug.
 
 The implementation is not a bug and neither was this a bug in the
 2.x series of the Python documentation.

Of course the 2.x documentation is wrong, in that it is severely
underspecified, and the most straight-forward interpretation of the
specific wording gives an incorrect impression of the implementation.

 The Python 3.x docs apparently
 introduced a reference to the language spec which is clearly not
 capturing the wealth of possible inputs.

Right - but only because the 2.x documentation *already* suggested that
the supported syntax matches the literal syntax - as that's the most
natural thing to assume.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 [...]
 For direct entry by an interactive user, yes. Why are some people in
 this discussion thinking only of direct entry by an interactive user?
 
 Ultimately, somebody will have entered the data.

I don't think you really believe that all data processed by a
computer was eventually manually entered by a someone :-)

I already gave you a couple of examples of how such data can
end up being input for Python number constructors. If you are
still curious, please see the Wikipedia pages I linked to,
or have a look at these keyboards:

http://en.wikipedia.org/wiki/File:KB_Arabic_MAC.svg
http://en.wikipedia.org/wiki/File:Keyboard_Layout_Sanskrit.png
http://en.wikipedia.org/wiki/File:800px-KB_Thai_Kedmanee.png
http://en.wikipedia.org/wiki/File:Tibetan_Keyboard.png
http://en.wikipedia.org/wiki/File:KBD-DZ-noshift-2009.png

(all referenced on http://en.wikipedia.org/wiki/Keyboard_layout)

and then compare these to:

http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt

Arabic numerals are being used a lot nowadays in Asian countries,
but that doesn't mean that the native script versions are not
being used anymore.

Furthermore, data can well originate from texts that were written
hundreds or even thousands of years ago, so there is plenty of
material available for processing.

Even if not entered directly, there are plenty of ways to convert
Arabic numerals (or other numeral systems) to the above forms,
e.g. in MS Office for Thai:

http://office.microsoft.com/en-us/excel-help/convert-arabic-numbers-to-thai-text-format-HP003074364.aspx

Anyway, as mentioned before: all this is really besides the point:

If we want to support Unicode in Python, we have to also support
conversion of numerals declared in Unicode into a form that can
be processed by Python. Regardless of where such data originates.

If we were not to follow this approach, we could just as well
decide not support support reading Egyptian Hieroglyphs based
on the argument that there's no keyboard to enter them...

http://www.unicode.org/charts/PDF/U13000.pdf  :-)

(from http://www.unicode.org/charts/)

 Input from an existing text file, as I said earlier.
 
 Which *specific* existing text file? Have you actually *seen* such a
 text file?

Have you tried Google ?

http://www.google.com/search?q=١٢٣
http://www.google.com/search?q=٣+site%3Agov.lb

Some examples:

http://www.bdl.gov.lb/circ/intpdf/int123.pdf
http://www.cdr.gov.lb/study/sdatl/Arabic/Chapter3.PDF
http://www.batroun.gov.lb/PDF/Waredat2006.pdf

(these all use http://en.wikipedia.org/wiki/Eastern_Arabic_numerals)

 Direct entry at the console is a red herring.
 
 And we don't need powerhouses because power comes out of the socket.

Martin, the argument simply doesn't fit well with the discussion
about Python and Unicode.

We introduced Unicode in Python not because there was a need
for each and every code point in Unicode, but because we wanted
to adopt a standard which doesn't prefer any one way of writing
things over another.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
 Arabic numerals are being used a lot nowadays in Asian countries,
 but that doesn't mean that the native script versions are not
 being used anymore.

I never claimed that people are not using their local scripts to enter
numbers. However, none of your examples is about Chinese numerals using
an ASCII full stop as a decimal point. The only thing I claimed about
usage (actually only repeating haiyang kang's earlier claim) is that
nobody would enter Chinese numerals with a keyboard and then use full
stop as the decimal separator.

So all your counter-examples just don't apply - I don't deny them.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Steven D'Aprano

Martin v. Löwis wrote:

Then these users should speak up and indicate their need, or somebody
should speak up and confirm that there are users who actually want
'١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
system in which '١٢٣٤.٥٦e4' means 12345600.0.

I'm not sure what you're after here.


That the current float() constructor accepts tons of bogus character
strings and accepts them as numbers, and that it should stop doing so.


What bogus characters do the float() and int() constructors accept? As 
far as I can see, they only accepts numerals.



[...]

Notice that Python does *not* currently support printing numbers in
other scripts - even though this may actually be more useful than
parsing.


Lack of one function, even if more useful, does not imply that an 
existing function should be removed.


[...]

In the case of number parsing, I think Python would be better if
float() rejected non-ASCII strings, and any support for such parsing
should be redone correctly in a different place (preferably along with
printing of numbers).


So your problems with the current behaviour are:

(1) in some unspecified way, it's not done correctly;

(2) it belongs somewhere other than float() and int().

That second is awfully close to bike-shedding. Since you accept that 
Python *should* have the current behaviour, and Python *already* has the 
current behaviour, it seems strange that you are kicking up such a fuss 
merely to *move* the implementation of that behaviour out of the numeric 
constructors into some unspecified different place.


I think it would be constructive to explain:

- how the current behaviour is incorrect;
- your suggestions for correcting it; and
- a concrete suggestion for where you would like to see the behaviour 
moved to, and why that would be better than where it currently is.




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou solip...@pitrou.net wrote:
..
 I don't think so.  str.split() and str.splitlines() are also defined in
 conformance to the SPEC, AFAIK.  They certainly try to.

You are joking, right?  Where exactly does Unicode specify something like this:

 ''.join('̀́̂'.split('\udf00\ud800'))
'́̂'
?

OK, splitting on a given separator has very little to do with Unicode
or UCD, but str.splitlines()  makes absolutely no attempt to conform
to Unicode Standard Annex #14 (Unicode line breaking algorithm).
Wait, UAX #14 is actually relevant to textwrap module which saw very
little change since 2.x days.  So, what exactly does str.splitlines()
do?   And which part of the Unicode standard defines how it is
different from str.split(.., '\n')?  Reference manual does not help me
here either:


str.splitlines([keepends])

Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is
given and true.
 http://docs.python.org/dev/library/stdtypes.html#str.splitlines
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Antoine Pitrou
Le jeudi 02 décembre 2010 à 16:34 -0500, Alexander Belopolsky a écrit :
 On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou solip...@pitrou.net wrote:
 ..
  I don't think so.  str.split() and str.splitlines() are also defined in
  conformance to the SPEC, AFAIK.  They certainly try to.
 
 You are joking, right?

Perhaps you could look at the implementation.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
Am 02.12.2010 22:30, schrieb Steven D'Aprano:
 Martin v. Löwis wrote:
 Then these users should speak up and indicate their need, or somebody
 should speak up and confirm that there are users who actually want
 '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
 system in which '١٢٣٤.٥٦e4' means 12345600.0.
 I'm not sure what you're after here.

 That the current float() constructor accepts tons of bogus character
 strings and accepts them as numbers, and that it should stop doing so.
 
 What bogus characters do the float() and int() constructors accept? As
 far as I can see, they only accepts numerals.

Not bogus characters, but bogus character strings. E.g. strings that mix
digits from different scripts, and mix them with the Python decimal
separator.

 Notice that Python does *not* currently support printing numbers in
 other scripts - even though this may actually be more useful than
 parsing.
 
 Lack of one function, even if more useful, does not imply that an
 existing function should be removed.

No. But if the specific function(ality) is not useful and
underspecified, it should be removed.

 So your problems with the current behaviour are:
 
 (1) in some unspecified way, it's not done correctly;

No. My main concern is that it is not properly specified. If it was
specified, I could then tell you what precisely is wrong about it.
Right now, I can only give examples for input that it should not accept,
and examples of input that it should, but does not accept.

 (2) it belongs somewhere other than float() and int().

That's only because it also needs a parameter to specify what syntax to
follow, somehow. That parameter could be explicit or implicit, and it
could be to float or to some other function. But it must be available,
and is not.

 That second is awfully close to bike-shedding. Since you accept that
 Python *should* have the current behaviour

No, I don't. I think it behaves incorrectly, accepting garbage input and
guessing some meaning out of it.

 - how the current behaviour is incorrect;

See above: it accepts strings that do not denote real numbers in any
writing system, and, despite the claim that the feature is there to
support other writing systems, actually does not truly support other
writing systems.

 - your suggestions for correcting it; and

Make the current implementation exactly match the current documentation.
I think the documentation is correct; the implementation is wrong.

 - a concrete suggestion for where you would like to see the behaviour
 moved to, and why that would be better than where it currently is.

The current behavior should go nowhere; it is not useful. Something very
similar to the current behavior (but done correctly) should go into the
locale module.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote:
..
 Have you tried Google ?


I tried google at I could not find any plain text or HTML file that
would use Arabic-Indic numerals.  What was interesting, though that a
search for quran unicode (without quotes).  Brought me to
http://www.sacred-texts.com which says that they've been using unicode
since 2002 in their archives.  Interestingly enough, their version of
Qur'an uses ordinary digits for ayah numbers.  See, for example
http://www.sacred-texts.com/isl/uq/050.htm.

I will change my mind on this issue when you present a
machine-readable file with Arabic-Indic numerals and a program capable
of reading it and show that this program uses the same number parsing
algorithm as Python's int() or float().
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Mark Dickinson
On Thu, Dec 2, 2010 at 8:23 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 In the case of number parsing, I think Python would be better if
 float() rejected non-ASCII strings, and any support for such parsing
 should be redone correctly in a different place (preferably along with
 printing of numbers).

+1.  The set of strings currently accepted by the float constructor
just seems too ad hoc to be at all useful.  Apart from the decimal
separator issue, and the question of exactly which decimal digits are
accepted and which aren't, there are issues like this one:

 x = '\uff11\uff25\uff0b\uff11\uff10'
 x
'1E+10'
 float(x)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'decimal' codec can't encode character '\uff25' in
position 1: invalid decimal Unicode string
 y = '\uff11E+\uff11\uff10'
 y
'1E+10'
 float(y)
100.0

That is, fullwidth *digits* are allowed, but none of the other
characters can be fullwidth variants.  Unfortunately, a float string
doesn't consist solely of digits, and it seems to me to make little
sense to allow variation in the digits without allowing corresponding
variations in the other characters that might appear ('.', 'e', 'E',
'+', '-').

A couple of slightly trickier decisions: (1) the float constructor
currently does accept leading and trailing whitespace;  should it
allow any Unicode whitespace characters here? I'd say yes. (2) For
int() rather than float(), there's a bit more value in allowing the
variant digits, since it provides an easy way to interpret those
digits.  The decimal module currently makes use of this, for example
(the decimal spec requires that non-European digits be accepted).  I'd
be happier if this functionality were moved elsewhere, though.  The
int constructor is, if anything, currently worse off than float,
thanks to its attempts to support non-decimal bases.

There's value in having an easy-to-specify, easy-to-maintain API for
these basic builtin functions.  For one thing, it helps non-CPython
implementations.

[MAL]
 The Python 3.x docs apparently
 introduced a reference to the language spec which is clearly not
 capturing the wealth of possible inputs.

That documentation update was my fault;  I was motivated to make the
update by issues unrelated to this one (mostly to do with Python 3's
more consistent handling of inf and nan, as a result of all the new
float-string conversion code).  If I'd been thinking harder, I would
have remembered that float accepted the non-European digits and added
a note to that effect.  This (unintentional) omission does underline
the point that it's difficult right now to document and understand
exactly what the float constructor does or doesn't accept.

Mark
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Eric Smith

On 12/2/2010 4:48 PM, Martin v. Löwis wrote:

Am 02.12.2010 22:30, schrieb Steven D'Aprano:

Martin v. Löwis wrote:

Then these users should speak up and indicate their need, or somebody
should speak up and confirm that there are users who actually want
'١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
system in which '١٢٣٤.٥٦e4' means 12345600.0.

I'm not sure what you're after here.


That the current float() constructor accepts tons of bogus character
strings and accepts them as numbers, and that it should stop doing so.


What bogus characters do the float() and int() constructors accept? As
far as I can see, they only accepts numerals.


Not bogus characters, but bogus character strings. E.g. strings that mix
digits from different scripts, and mix them with the Python decimal
separator.


Notice that Python does *not* currently support printing numbers in
other scripts - even though this may actually be more useful than
parsing.


Lack of one function, even if more useful, does not imply that an
existing function should be removed.


No. But if the specific function(ality) is not useful and
underspecified, it should be removed.


So your problems with the current behaviour are:

(1) in some unspecified way, it's not done correctly;


No. My main concern is that it is not properly specified. If it was
specified, I could then tell you what precisely is wrong about it.
Right now, I can only give examples for input that it should not accept,
and examples of input that it should, but does not accept.


(2) it belongs somewhere other than float() and int().


That's only because it also needs a parameter to specify what syntax to
follow, somehow. That parameter could be explicit or implicit, and it
could be to float or to some other function. But it must be available,
and is not.


That second is awfully close to bike-shedding. Since you accept that
Python *should* have the current behaviour


No, I don't. I think it behaves incorrectly, accepting garbage input and
guessing some meaning out of it.


- how the current behaviour is incorrect;


See above: it accepts strings that do not denote real numbers in any
writing system, and, despite the claim that the feature is there to
support other writing systems, actually does not truly support other
writing systems.


- your suggestions for correcting it; and


Make the current implementation exactly match the current documentation.
I think the documentation is correct; the implementation is wrong.


- a concrete suggestion for where you would like to see the behaviour
moved to, and why that would be better than where it currently is.


The current behavior should go nowhere; it is not useful. Something very
similar to the current behavior (but done correctly) should go into the
locale module.


I agree with everything Martin says here. I think the basic premise is: 
you won't find strings in the wild that use non-ASCII digits but do 
use the ASCII dot as a decimal point. And that's what float() is looking 
for. (And that doesn't even begin to address what it expects for an 
exponent 'e'.)


Eric.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Eric Smith wrote:
 The current behavior should go nowhere; it is not useful. Something very
 similar to the current behavior (but done correctly) should go into the
 locale module.
 
 I agree with everything Martin says here. I think the basic premise is:
 you won't find strings in the wild that use non-ASCII digits but do
 use the ASCII dot as a decimal point. And that's what float() is looking
 for. (And that doesn't even begin to address what it expects for an
 exponent 'e'.)

http://en.wikipedia.org/wiki/Decimal_mark

In China, comma and space are used to mark digit groups because dot is used as 
decimal mark.

Note that float() can also parse integers, it just returns them as
floats :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 Have you tried Google ?

 
 I tried google at I could not find any plain text or HTML file that
 would use Arabic-Indic numerals.  What was interesting, though that a
 search for quran unicode (without quotes).  Brought me to
 http://www.sacred-texts.com which says that they've been using unicode
 since 2002 in their archives.  Interestingly enough, their version of
 Qur'an uses ordinary digits for ayah numbers.  See, for example
 http://www.sacred-texts.com/isl/uq/050.htm.
 
 I will change my mind on this issue when you present a
 machine-readable file with Arabic-Indic numerals and a program capable
 of reading it and show that this program uses the same number parsing
 algorithm as Python's int() or float().

Have you had a look at the examples I posted ? They include texts
and tables with numbers written using east asian arabic numerals.

Here's an example of a a famous Chinese text using Chinese numerals:

http://ctext.org/nine-chapters

Unfortunately, the Chinese numerals are not listed in the Category Nd,
so Python won't be able to parse them. This has various reasons, it
seems, one of them being that the numeral code points were not defined
as range of code points.

I'm sure you can find other books on mathematics in sanscrit or
arabic scripts as well.

But this whole branch of the discussion is not going to go anywhere.

The point is that we support all of Unicode in Python, not just a fragment,
and therefore the numeric constructors support all of Unicode.

Using them, it's very easy to support numbers in all kinds of variants,
whether bound to a locale or not.

Adding more locale aware numeric parsers and formatters to the
locale module, based on these APIs is certainly a good idea,
but orthogonal to the ongoing discussion, IMO.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
 Nick Coghlan wrote:
 On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburgm...@egenix.com  wrote:
 If we would go down that road, we would also have to disable other
 Unicode features based on locale, e.g. whether to apply non-ASCII
 case mappings, what to consider whitespace, etc.

 We don't do that for a good reason: Unicode is supposed to be
 universal and not limited to a single locale.

 Because parsing numbers is about more than just the characters used
 for the individual digits. There are additional semantics associated
 with digit ordering (for any number) and decimal separators and
 exponential notation (for floating point numbers) and those vary by
 locale. We deliberately chose to make the builtin numeric parsers
 unaware of all of those things, and assuming that we can simply parse
 other digits as if they were their ASCII equivalents and otherwise
 assume a C locale seems questionable.

 Sure, and those additional semantics are locale dependent, even
 between ASCII-only locales. However, that does not apply to the
 basic building blocks, the decimal digits themselves.

 If the existing semantics can be adequately defined, documented and
 defended, then retaining them would be fine. However, the language
 reference needs to define the behaviour properly so that other
 implementations know what they need to support and what can be chalked
 up as being just an implementation accident of CPython. (As a point in
 the plus column, both decimal.Decimal and fractions.Fraction were able
 to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
 and float handling)

 The support is built into the C API, so there's not really much
 surprise there.

 Regarding documentation, we'd just have to add that numbers may
 be made up of an Unicode code point in the category Nd.

 See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
 4.6 for details

 
 Decimal digits form a large subcategory of numbers consisting of those
 digits that can be
 used to form decimal-radix numbers. They include script-specific
 digits, but exclude char-
 acters such as Roman numerals and Greek acrophonic numerals. (Note
 that1, 5  = 15 =
 fifteen, butI, V  = IV = four.) Decimal digits also exclude the
 compatibility subscript or
 superscript digits to prevent simplistic parsers from misinterpreting
 their values in context.
 

 int(), float() and long() (in Python2) are such simplistic
 parsers.
 
 Since you are the knowledgable advocate of the current behavior, perhaps
 you could open an issue and propose a doc patch, even if not .rst
 formatted.

Good suggestion. I tried to collect as much context as possible:

http://bugs.python.org/issue10610

I'll leave the rst-magic to someone else, but will certainly help
if you have more questions about the details.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote:
..
 I will change my mind on this issue when you present a
 machine-readable file with Arabic-Indic numerals and a program capable
 of reading it and show that this program uses the same number parsing
 algorithm as Python's int() or float().

 Have you had a look at the examples I posted ? They include texts
 and tables with numbers written using east asian arabic numerals.

Yes, but this was all about output.  I am pretty sure TeX was able to
typeset Qur'an in all its glory long before Unicode was invented.
Yet, in machine readable form it would be something like {\quran 1}
(invented directive).   I have asked for a file that is intended for
machine processing, not for human enjoyment in print or on a display.
 I claim that if such file exists, the program that reads it does not
use the same rules as Python and converting non-ascii digits would be
a tiny portion of what that program does.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
Am 02.12.2010 23:43, schrieb M.-A. Lemburg:
 Eric Smith wrote:
 The current behavior should go nowhere; it is not useful. Something very
 similar to the current behavior (but done correctly) should go into the
 locale module.

 I agree with everything Martin says here. I think the basic premise is:
 you won't find strings in the wild that use non-ASCII digits but do
 use the ASCII dot as a decimal point. And that's what float() is looking
 for. (And that doesn't even begin to address what it expects for an
 exponent 'e'.)
 
 http://en.wikipedia.org/wiki/Decimal_mark
 
 In China, comma and space are used to mark digit groups because dot is used 
 as decimal mark.

I may be misinterpreting that, but I think that refers to the case of
writing numbers using Arabic digits.

Chinese digits are, e.g., used in the Suzhou numerals

http://en.wikipedia.org/wiki/Suzhou_numerals

This doesn't have a decimal point at all. Instead, the second line
(below or left to the actual digits) describes the power of ten and
the unit of measurement (i.e. similar to scientific notation,
but with ideographs for the powers of ten).

In another writing system, they use 点 (U+70B9) as the decimal
separator, see

http://en.wikipedia.org/wiki/Chinese_numerals#Fractional_values

In the same system, the integral part uses multipliers, i.e.
12345 is [1][1][2][1000][3][100][4][10][5]; the fractional
part uses regular digits.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Eric Smith

On 12/2/2010 5:43 PM, M.-A. Lemburg wrote:

Eric Smith wrote:

The current behavior should go nowhere; it is not useful. Something very
similar to the current behavior (but done correctly) should go into the
locale module.


I agree with everything Martin says here. I think the basic premise is:
you won't find strings in the wild that use non-ASCII digits but do
use the ASCII dot as a decimal point. And that's what float() is looking
for. (And that doesn't even begin to address what it expects for an
exponent 'e'.)


http://en.wikipedia.org/wiki/Decimal_mark

In China, comma and space are used to mark digit groups because dot is used as 
decimal mark.


Is that an ASCII dot? That page doesn't say.


Note that float() can also parse integers, it just returns them as
floats :-)


:)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Martin v. Löwis
 The point is that we support all of Unicode in Python, not just a fragment,
 and therefore the numeric constructors support all of Unicode.

That conclusion is as false today as it was in Python 1.6, but only now
people start caring about that.

a) we don't support all of Unicode in numeric constructors. There are
   lots of things that you can write down that readers would recognize
   as a real/rational/integral number that float() won't parse.
b) if float() would restrict itself to the scientific notation of
   real numbers (as it should), Python could well continue to claim all
   of Unicode.

 Adding more locale aware numeric parsers and formatters to the
 locale module, based on these APIs is certainly a good idea,
 but orthogonal to the ongoing discussion, IMO.

Not at all. The concept of Unicode numbers is flawed: Unicode does
*not* prescribe any specific way to denote numbers. Unicode is about
characters, and Python supports the Unicode characters for digits as
well as it supports all the other Unicode characters.

Instead, support for non-scientific notation of real numbers should
be based on user needs, which probably can be approximated by looking
at actual scripts. This, in turn, is inherently locale-dependent.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Eric Smith wrote:
 On 12/2/2010 5:43 PM, M.-A. Lemburg wrote:
 Eric Smith wrote:
 The current behavior should go nowhere; it is not useful. Something
 very
 similar to the current behavior (but done correctly) should go into the
 locale module.

 I agree with everything Martin says here. I think the basic premise is:
 you won't find strings in the wild that use non-ASCII digits but do
 use the ASCII dot as a decimal point. And that's what float() is looking
 for. (And that doesn't even begin to address what it expects for an
 exponent 'e'.)

 http://en.wikipedia.org/wiki/Decimal_mark

 In China, comma and space are used to mark digit groups because dot
 is used as decimal mark.
 
 Is that an ASCII dot? That page doesn't say.

Yes, but to be fair: I think that the page actually refers to the
use of the Arabic numeral format in China, rather than with their
own script symbols.

 Note that float() can also parse integers, it just returns them as
 floats :-)
 
 :)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote:
..
 Some examples:

 http://www.bdl.gov.lb/circ/intpdf/int123.pdf

I looked at this one more closely.  While I cannot understand what it
says, It appears that Arabic numerals are used in dates.   It looks
like Python want be able to deal with those:

 datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d')
..
ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d'

Interestingly,

 datetime.strptime('١٩٩٩', '%Y')
datetime.datetime(1999, 1, 1, 0, 0)

which further suggests that support of such numerals is accidental.

As I think more about it, though I am becoming less avert to accepting
these numerals for base 10 integers.  Integers can be easily extracted
from text using simple regex and '\d' accepts all category Nd
characters.  I would require though that all digits be from the same
block, which is not hard because Unicode now promises to only have
them in contiguous blocks of 10.   This rule seems to address some of
security issues because it is unlikely that a system that can display
some of the local digits would not be able to display all of them
properly.

I still don't think it makes any sense to accept them in float().
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Steven D'Aprano

Stephen J. Turnbull wrote:

Steven D'Aprano writes:

  With full respect to haiyang kang, hear-say from one person can hardly 
  be described as strong evidence


That's *disrespectful* nonsense.  What Haiyang reported was not
hearsay, it's direct observation of what he sees around him and
personal experience, plus extrapolation.  Look up hearsay, please.


Fair enough. I choose my words poorly and apologise. A better 
description would be anecdotal evidence.



--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Terry Reedy

On 12/2/2010 6:54 PM, Alexander Belopolsky wrote:

On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburgm...@egenix.com  wrote:
..

Some examples:

http://www.bdl.gov.lb/circ/intpdf/int123.pdf


I looked at this one more closely.  While I cannot understand what it
says, It appears that Arabic numerals are used in dates.   It looks
like Python want be able to deal with those:


When I travelled in S. Asia around 25 years ago, arabic and indic 
numerals were in obvious use in stores, road signs, and banks (as with 
money exchange receipts). I learned the digits partly for 
self-protestions ;-). I have no real idea of what is done *now* in 
computerized business, but I assume the native digits are used.


It may well be that there is no Python software yet that operates with 
native digits. The lack of direct output capability would hinder that. 
Of course, someone could run both input and output through 
language-specific str.translate digit translators.



datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d')


Googling ١٩٩٩ gets about 83,000 hits.

..
ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d'

Interestingly,


datetime.strptime('١٩٩٩', '%Y')

datetime.datetime(1999, 1, 1, 0, 0)

which further suggests that support of such numerals is accidental.

As I think more about it, though I am becoming less avert to accepting
these numerals for base 10 integers.


Both input and output are needed for educational programming, though 
translation tables might be enough.


  Integers can be easily extracted

from text using simple regex and '\d' accepts all category Nd
characters.  I would require though that all digits be from the same
block, which is not hard because Unicode now promises to only have
them in contiguous blocks of 10.


That seems sensible.

 This rule seems to address some of

security issues because it is unlikely that a system that can display
some of the local digits would not be able to display all of them
properly.

I still don't think it makes any sense to accept them in float().


For the present, I would pretty well agree with that, at least until we 
know more.


You have raised an important issue. It is a bit of a chicken and egg 
problem though. We will not really know what is needed until Python is 
used more in non-english/non-euro contexts, while such usage may await 
better support.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Stephen J. Turnbull
Lennart Regebro writes:
  2010/12/2 Stephen J. Turnbull step...@xemacs.org:

   T1000 = float('一.◯◯◯')
  
  That was already discussed here, and it's clear that unicode does not
  consider these characters to be something you can use in a decimal
  number, and hence it's not broken.

Huh?  IOW, use Unicode features just because they're there, what the
users want and use doesn't matter?

The only evidence I've seen so far that this feature is anything but a
a toy for a small faction of developers is Neil Hodgson's information
that OOo will generate these kinds of digits (note that it *will* do
Han! so the evidence is as good for users demanding Han numerals as
for any other kind, Unicode.org definitions notwithstanding), and that
DOS CP 864 contains the Indo/Arabic versions.

Of course, it's quite possible that those were toys for the developers
of those software packages too.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread haiyang kang
 Furthermore, data can well originate from texts that were written
 hundreds or even thousands of years ago, so there is plenty of
 material available for processing.

humm...,  for this, i think we need a special tuned language
processing system to handle this, and one subsystem for one language :)...
(sometimes a single word is not enough, we also need context)

Take pi for example, in modern math, it is wrote as: 3.1415...;
 in old China, it is sometimes wrote as: 三一四一五 or
 三点一四一五 or 叁点壹肆壹伍;

And if these texts are extracted through scanner
 (OCR or other image processing tech),  in my POV,
it is the job of this image processing subsystem
 (or some other subsystem between the image processing and database)
to do the mapping between number and raw text data, example table in DB:
text  | raw data|raw image data
---|-|---
3.1415 | 三一四一五| image...

br,
khy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Stephen J. Turnbull
Neil Hodgson writes:

 While I don't have Excel to test with, OpenOffice.org Calc will
  display in Arabic or Han numerals using the NatNum format codes.

Display is different from input, but at least this is concrete
evidence.

Will it accept Arabic on input?  (Han might be too much to ask for
since Unicode considers Han digits to be impure.)

   Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does
   not contain the Arabic digits that have been presented here
   earlier AFAICT.
  
 DOS code page 864 does use 0xB0-0xB9

OK, Microsoft thought it would be useful.

I'd still like to know whether people actually use them for input (or
output, for that matter -- anybody have a corpus of Arabic Form 10-Ks
to grep through?), but that's more concrete evidence than we've seen
before.  Thank you!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Stephen J. Turnbull
Antoine Pitrou writes:

  The legacy format argument looks like a red herring to me. When
  converting from a format to another it is the programmer's job to
  his/her job right.

Uhmm, the argument *for* this feature proposed by several people
is that Python's numeric constructors do it (right) so that the
programmer doesn't have to.

If Python *doesn't* do it right, why should Python do it at all?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread Alexander Belopolsky
On Thu, Dec 2, 2010 at 4:57 PM, Mark Dickinson dicki...@gmail.com wrote:
..
 (the decimal spec requires that non-European digits be accepted).

Mark,

I think *requires* is too strong of a word to describe what the spec
says.   The decimal module documentation refers to two authorities:

1. IBM’s General Decimal Arithmetic Specification
2. IEEE standard 854-1987

The IEEE standards predates Unicode and unsurprisingly does not have
anything related to the issue.  the IBM's spec says the following in
the Conversions section:


It is recommended that implementations also provide additional number
formatting routines (including some which are locale-dependent), and
if available should accept non-European decimal digits in strings.
 http://speleotrove.com/decimal/daconvs.html

This cannot possibly be interpreted as normative text.  The emphasis
is clearly on formatting routines with non-European decimal digits
added as an afterthought.  This recommendation can reasonably be
interpreted as a requirement that conversion routines should accept
what formatting routines can produce.  In Python there are no
formatting routines to produce non-European numerals, so there is no
requirement to accept them in conversions.

I don't think decimal module should support non-European decimal
digits.  The only place where it can make some sense is in int()
because here we have a fighting chance of producing a reasonable
definition.   The motivating use case is conversion of numerical data
extracted from text using simple '\d+'  regex matches.

Here is how I would do it:

1.  String x of non-European decimal digits is only accepted in
int(x), but not by int(x, 0) or int(x, 10).
2.  If x contains one or more non-European digits, then

(a)  all digits must be from the same block:

  def basepoint(c):
return ord(c) - unicodedata.digit(c)
  all(basepoint(c) == basepoint(x[0]) for c in x) - True

 (b) and '+' or '-' sign is not alowed.

3. A character c is a digit if it matches '\d' regex.  I think this
means unicodedata.category(c) - 'Nd'.

Condition 2(b) is important because there is no clear way to define
what is acceptable as '+' or '-' using Unicode character properties
and not all number systems even support local form of negation.  (It
is also YAGNI.)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/30/2010 10:05 AM, Alexander Belopolsky wrote:
 
 My general answers to the questions you have raised are as follows:
 
 1. Each new feature release should use the latest version of the UCD as
 of the first beta release (or perhaps a week or so before). New chars
 are new features and the beta period can be used to (hopefully) iron out
 any bugs introduced by a new UCD version.

The UCD is versioned just like Python is, so if the Unicode Consortium
decides to ship a 5.2.1 version of the UCD, we can add that to Python 2.7.x,
since Python 2.7 started out with 5.2.0.

 2. The language specification should not be UCD version specific. Martin
 pointed out that the definition of identifiers was intentionally written
 to not be, bu referring to 'current version' or some such. On the other
 hand, the UCD version used should be programatically discoverable,
 perhaps as an attribute of sys or str.

It already is and has been for while, e.g.

Python 2.5:
 import unicodedata
 unicodedata.unidata_version
'4.1.0'

 3.. The UCD should not change in bugfix releases. New chars are new
 features. Adding them in bugfix releases will introduce gratuitous
 imcompatibilities between releases. People who want the latest Unicode
 should either upgrade to the latest Python version or patch an older
 version (but not expect core support for any problems that creates).

See above. Patch level revisions of the UCD are fine for patch level
releases of Python, since those patch level revisions of the UCD fix
bugs just like we do in Python.

Note that each new UCD major.minor version is a new standard on its
own, so it's perfectly ok to stick with one such standard version
per Python version.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 01 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Am 30.11.2010 21:24, schrieb Ben Finney:
 haiyang kang corn...@gmail.com writes:

   I think it is a little ugly to have code like this: num =
 float(一.一), expected result is: num = 1.1

 That's a straw man, though. The string need not be a literal in the
 program; it can be input to the program.

 num = float(input_from_the_external_world)

 Does that change your assessment of whether non-ASCII digits are used?
 
 I think the OP (haiyang kang) already indicated that he finds it quite
 unlikely that anybody would possibly want to enter that. You would need
 a number of key strokes to enter each individual ideograph, plus you
 have to press the keys for keyboard layout switching to enter the Latin
 decimal separator (which you normally wouldn't use along with the Han
 numerals).

That's a somewhat limited view, IMHO. Numbers are not always entered
using a computer keyboard, you have tool like cash registries, special
numeric keypads, scanners, OCR, etc. for external entry, and you also
have other programs producing such output, e.g. MS Office if configured
that way.

The argument with the decimal point doesn't work well either, since
it's obvious that float() and int() do not support localized input.

E.g. in Germany we write 3,141 instead of 3.141:

 float('3,141')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for float(): 3,141

No surprise there. The localization of the input data, e.g. removal
of thousands separators and conversion of decimal marks to the dot,
have to be done by the application, just like you have to now for
German floating point number literals.

The locale module already has locale.atof() and locale.atoi() for
just this purpose.

FYI, here's a list of decimal digits supported by Python 2.7:

http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt:

0030..0039; Decimal # Nd  [10] DIGIT ZERO..DIGIT NINE
0660..0669; Decimal # Nd  [10] ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT 
NINE
06F0..06F9; Decimal # Nd  [10] EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED 
ARABIC-INDIC DIGIT NINE
07C0..07C9; Decimal # Nd  [10] NKO DIGIT ZERO..NKO DIGIT NINE
0966..096F; Decimal # Nd  [10] DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE
09E6..09EF; Decimal # Nd  [10] BENGALI DIGIT ZERO..BENGALI DIGIT NINE
0A66..0A6F; Decimal # Nd  [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGIT NINE
0AE6..0AEF; Decimal # Nd  [10] GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE
0B66..0B6F; Decimal # Nd  [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE
0BE6..0BEF; Decimal # Nd  [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE
0C66..0C6F; Decimal # Nd  [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE
0CE6..0CEF; Decimal # Nd  [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE
0D66..0D6F; Decimal # Nd  [10] MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE
0E50..0E59; Decimal # Nd  [10] THAI DIGIT ZERO..THAI DIGIT NINE
0ED0..0ED9; Decimal # Nd  [10] LAO DIGIT ZERO..LAO DIGIT NINE
0F20..0F29; Decimal # Nd  [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE
1040..1049; Decimal # Nd  [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT NINE
1090..1099; Decimal # Nd  [10] MYANMAR SHAN DIGIT ZERO..MYANMAR SHAN DIGIT 
NINE
17E0..17E9; Decimal # Nd  [10] KHMER DIGIT ZERO..KHMER DIGIT NINE
1810..1819; Decimal # Nd  [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE
1946..194F; Decimal # Nd  [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE
19D0..19DA; Decimal # Nd  [11] NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM 
DIGIT ONE
1A80..1A89; Decimal # Nd  [10] TAI THAM HORA DIGIT ZERO..TAI THAM HORA 
DIGIT NINE
1A90..1A99; Decimal # Nd  [10] TAI THAM THAM DIGIT ZERO..TAI THAM THAM 
DIGIT NINE
1B50..1B59; Decimal # Nd  [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1BB0..1BB9; Decimal # Nd  [10] SUNDANESE DIGIT ZERO..SUNDANESE DIGIT NINE
1C40..1C49; Decimal # Nd  [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE
1C50..1C59; Decimal # Nd  [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGIT NINE
A620..A629; Decimal # Nd  [10] VAI DIGIT ZERO..VAI DIGIT NINE
A8D0..A8D9; Decimal # Nd  [10] SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE
A900..A909; Decimal # Nd  [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGIT NINE
A9D0..A9D9; Decimal # Nd  [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE
AA50..AA59; Decimal # Nd  [10] CHAM DIGIT ZERO..CHAM DIGIT NINE
ABF0..ABF9; Decimal # Nd  [10] MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT 
NINE
FF10..FF19; Decimal # Nd  [10] FULLWIDTH DIGIT ZERO..FULLWIDTH DIGIT NINE
104A0..104A9  ; Decimal # Nd  [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE
1D7CE..1D7FF  ; Decimal # Nd  [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL 
MONOSPACE DIGIT NINE


The Chinese and Japanese ideographs are not supported because of the
way they are defined in the Unihan database. I'm currently
investigating how we could support them as well.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  

Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
 
 I see no reason not to make a similar promise for numeric literals.  I
 see no good reason to allow compatibility full-width Japanese ASCII
 numerals or Arabic cursive numerals in for i in range(...) for
 example.
 
 I do not think that anyone, at least not me, has argued for anything
 other than 0-9 digits (or 0-f for hex) in literals in program code. The
 only issue is whether non-programmer *users* should be able to use their
 native digits in applications in response to input prompts.

Me neither. This is solely about Python being able to parse numeric
input in the float(), int() and complex() constructors.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 01 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Steven D'Aprano

Martin v. Löwis wrote:

Am 30.11.2010 23:43, schrieb Terry Reedy:

On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:


I see no reason not to make a similar promise for numeric literals.  I
see no good reason to allow compatibility full-width Japanese ASCII
numerals or Arabic cursive numerals in for i in range(...) for
example.

I do not think that anyone, at least not me, has argued for anything
other than 0-9 digits (or 0-f for hex) in literals in program code. The
only issue is whether non-programmer *users* should be able to use their
native digits in applications in response to input prompts.


And here, my observation stands: if they wanted to, they currently
couldn't - at least not for real numbers (and also not for integers
if they want to use grouping). So the presumed application of this
feature doesn't actually work, despite the presence of the feature it
was supposedly meant to enable.


By that argument, English speakers wanting to enter integers using 
Arabic numerals can't either! I'd like to use grouping for large 
literals, if only I could think of a half-decent syntax, and if only 
Python supported it. This fails on both counts:


x = 123_456_789_012_345

The lack of grouping and the lack of a native decimal point doesn't mean 
that the feature doesn't work -- it merely means the feature requires 
some compromise before it can be used.


In the same way, if I wanted to enter a number using non-Arabic digits, 
it works provided I compromise by using the Anglo-American decimal point 
instead of the European comma or the native decimal point I might prefer.


The lack of support for non-dot decimal points is arguably a bug that 
should be fixed, not a reason to remove functionality.



--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Lennart Regebro
On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull step...@xemacs.org wrote:
 Sure you can.  In Python program text, all keywords will be ASCII

Yes, yes, sure, but not the contents of variables,

 I see no reason not to make a similar promise for numeric literals.

Wait what, literas? The example was

 float('١٢٣٤.٥٦')

Which doesn't have any numeric literals in them at all. Do that work?
Nope, it's a syntax error. Too badm that would have been cool, but whatever.

Why would this be a problem:

 T1234 = float('١٢٣٤.٥٦')
 T1234
1234.56

But this OK?

 T١٢٣٤ = float('1234.56')
 T١٢٣٤
1234.56

I don't see that.


Should we bother to implement ١٢٣٤.٥٦ as a literal equivalent to
1234.56? Well, not unless somebody askes for it, or it turns out to be
easy. :-) But that's another question.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Alexander Belopolsky
On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburg m...@egenix.com wrote:
..
 With Python 3.1:

 exec('\u0CF1 = 1')
 Traceback (most recent call last):
  File stdin, line 1, in module
  File string, line 1
    ೱ = 1
      ^
 SyntaxError: invalid character in identifier

 but with Python 3.2a4:

 exec('\u0CF1 = 1')
 eval('\u0CF1')
 1

 Such changes are not new, but I agree that they should probably
 be highlighted in the What's new in Python x.x.


As of today, What’s New In Python 3.2 [1] does not even mention the
unicodedata upgrade to 6.0.0.  Here are the features form the
unicode.org summary [2] that I think should be reflected in Python's
What's New document:


* adds 2,088 characters, including over 1,000 additional symbols—chief
among them the additional emoji symbols, which are especially
important for mobile phones;

* corrects character properties for existing characters including
 - a general category change to two Kannada characters (U+0CF1,
U+0CF2), which has the effect of making them newly eligible for
inclusion in identifiers;

 - a general category change to one New Tai Lue numeric character
(U+19DA), which would have the effect of disqualifying it from
inclusion in identifiers unless grandfathering measures are in place
for the defining identifier syntax.


The above may be too verbose for inclusion to What’s New In Python
3.2, but I think we should add a possibly shorter summary with a link
to unicode.org for details.

PS: Yes, I think everyone should know about the Python 3.2 killer
feature: ('\N{CAT FACE WITH WRY SMILE}'!

[1] http://docs.python.org/dev/whatsnew/3.2.html
[2] http://www.unicode.org/versions/Unicode6.0.0/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Terry Reedy

On 12/1/2010 12:55 PM, Alexander Belopolsky wrote:

On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburgm...@egenix.com  wrote:
..

With Python 3.1:


exec('\u0CF1 = 1')

Traceback (most recent call last):
  File stdin, line 1, inmodule
  File string, line 1
ೱ = 1
  ^
SyntaxError: invalid character in identifier

but with Python 3.2a4:


exec('\u0CF1 = 1')
eval('\u0CF1')

1


Such changes are not new, but I agree that they should probably
be highlighted in the What's new in Python x.x.



As of today, What’s New In Python 3.2 [1] does not even mention the
unicodedata upgrade to 6.0.0.  Here are the features form the
unicode.org summary [2] that I think should be reflected in Python's
What's New document:


* adds 2,088 characters, including over 1,000 additional symbols—chief
among them the additional emoji symbols, which are especially
important for mobile phones;

* corrects character properties for existing characters including
  - a general category change to two Kannada characters (U+0CF1,
U+0CF2), which has the effect of making them newly eligible for
inclusion in identifiers;

  - a general category change to one New Tai Lue numeric character
(U+19DA), which would have the effect of disqualifying it from
inclusion in identifiers unless grandfathering measures are in place
for the defining identifier syntax.




The above may be too verbose for inclusion to What’s New In Python
3.2,


I think those 11 lines are pretty good. Put them in
('\N{CAT FACE WITH WRY SMILE}'!

Plus give a link to Unicode site (Issue numbers are implicit links).

--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Martin v. Löwis
 And here, my observation stands: if they wanted to, they currently
 couldn't - at least not for real numbers (and also not for integers
 if they want to use grouping). So the presumed application of this
 feature doesn't actually work, despite the presence of the feature it
 was supposedly meant to enable.
 
 By that argument, English speakers wanting to enter integers using
 Arabic numerals can't either!

That's correct, and the key point here for the argument. It's just not
*meant* to support localized number forms, but deliberately constrains
them to a formal grammar which users using it must be aware of in order
to use it.

 I'd like to use grouping for large
 literals, if only I could think of a half-decent syntax, and if only
 Python supported it. This fails on both counts:
 
 x = 123_456_789_012_345

Here you are confusing issues, though: this fragment uses the syntax of
the Python programming language. Whether or not the syntax of the
float() constructor arguments matches that syntax is also a subject of
the debate.

I take it that you speak in favor of the float syntax also being used
for the float() constructor.

 The lack of grouping and the lack of a native decimal point doesn't mean
 that the feature doesn't work -- it merely means the feature requires
 some compromise before it can be used.

No, it means that the Python programming language syntax for floating
point numbers just doesn't take local notation into account *at all*.
This is not a flaw - it just means that this feature is non-existent.

Now, for the float() constructor, some people in this thread have
claimed that it *is* aimed at people who want to enter numbers in their
local spellings. I claim that this feature either doesn't work, or is
absent also.

 In the same way, if I wanted to enter a number using non-Arabic digits,
 it works provided I compromise by using the Anglo-American decimal point
 instead of the European comma or the native decimal point I might prefer.

Why would you want that, if, what you really wanted, could not be
done. There certainly *is* a way to convert strings into floats,
and there would be a way if that restricted itself to the digits 0..9.
So it can't be the mere desire to convert strings to float that make
you ask for non-ASCII digits.

 The lack of support for non-dot decimal points is arguably a bug that
 should be fixed, not a reason to remove functionality.

I keep repeating my two concerns:
a) if that was a feature, it is not specified at all in the
   documentation. In fact, the documentation was recently clarified
   to deny existence of that feature.
b) fixing it will be much more difficult than you apparently think.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Martin v. Löwis
 I think the OP (haiyang kang) already indicated that he finds it quite
 unlikely that anybody would possibly want to enter that.
 
 Who's talking about *entering* it into the program at a keyboard
 directly, though? Input to a program can come from all kinds of crazy
 sources. Just because it wasn't typed by the person at the keyboard
 using this program doesn't stop it being input to the program.

I think haiyang kang claimed exactly that - it won't ever be input to a
program. I trust him on that - and so should you, unless you have
sufficient experience with the Chinese language and writing system.

 Note that I'm not saying this is common. Nor am I saying it's a
 desirable situation. I'm saying it is a feasible use case, to be
 dismissed only if there is strong evidence that it's not used by
 existing Python code.

And indeed, for the Chinese numerals, we have such strong evidence.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Martin v. Löwis
 As of today, What’s New In Python 3.2 [1] does not even mention the
 unicodedata upgrade to 6.0.0.

One reason was that I was instructed not to change What's New a few
years ago.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Steven D'Aprano

Martin v. Löwis wrote:

I think the OP (haiyang kang) already indicated that he finds it quite
unlikely that anybody would possibly want to enter that.

Who's talking about *entering* it into the program at a keyboard
directly, though? Input to a program can come from all kinds of crazy
sources. Just because it wasn't typed by the person at the keyboard
using this program doesn't stop it being input to the program.


I think haiyang kang claimed exactly that - it won't ever be input to a
program. I trust him on that - and so should you, unless you have
sufficient experience with the Chinese language and writing system.


Note that I'm not saying this is common. Nor am I saying it's a
desirable situation. I'm saying it is a feasible use case, to be
dismissed only if there is strong evidence that it's not used by
existing Python code.


And indeed, for the Chinese numerals, we have such strong evidence.


With full respect to haiyang kang, hear-say from one person can hardly 
be described as strong evidence -- particularly, as Alexander 
Belopolsky pointed out, the use-case described isn't currently supported 
by Python. Given that what haiyang kang describes *can't* be done, the 
fact that people don't do it is hardly surprising -- nor is it a good 
reason for taking away functionality that does exist.




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Alexander Belopolsky
On Wed, Dec 1, 2010 at 5:36 PM, Martin v. Löwis mar...@v.loewis.de wrote:
..
 Note that I'm not saying this is common. Nor am I saying it's a
 desirable situation. I'm saying it is a feasible use case, to be
 dismissed only if there is strong evidence that it's not used by
 existing Python code.

 And indeed, for the Chinese numerals, we have such strong evidence.


Indeed: it over 10 years that Python's int() accepted Arabic-Indic
numerals, nobody has complained that it *did not* accept Chinese.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Steven D'Aprano

Martin v. Löwis wrote:

And here, my observation stands: if they wanted to, they currently
couldn't - at least not for real numbers (and also not for integers
if they want to use grouping). So the presumed application of this
feature doesn't actually work, despite the presence of the feature it
was supposedly meant to enable.

By that argument, English speakers wanting to enter integers using
Arabic numerals can't either!


That's correct, and the key point here for the argument. It's just not
*meant* to support localized number forms, but deliberately constrains
them to a formal grammar which users using it must be aware of in order
to use it.


You're *agreeing* that English speakers can't enter integers using 
Arabic numerals? What do you think I'm doing when I do this?


 int(1234)
1234

Ah wait... did you think I meant Arabic numerals in the sense of digits 
used by Arabs in Arabia? I meant Arabic numerals as opposed to Roman 
numerals. Sorry for the confusion.


Your argument was that even though Python's int() supports many 
non-ASCII digits, the lack of grouping means that it doesn't actually 
work. If that argument were correct, then it applies equally to ASCII 
digits as well.


It's clearly nonsense to say that int(1234) doesn't work just 
because of the lack of grouping. It's equally nonsense to say that

int(١٢٣٤) doesn't work because of the lack of grouping.


[...]

I take it that you speak in favor of the float syntax also being used
for the float() constructor.


I'm sorry, I don't understand what you mean here. I've repeatedly said 
that the syntax for numeric literals should remain constrained to the 
ASCII digits, as it currently is.


n = ١٢٣٤

gives a SyntaxError, and I don't want to see that change.

But I've also argued that the float constructor currently accepts 
non-ASCII strings:


n = int(١٢٣٤)

we should continue to support the existing behaviour. None of the 
arguments against it seem convincing to me, particularly since the 
opponents of the current behaviour admit that there is a use-case for 
it, but they just want it to move elsewhere, such as the locale module.


We've even heard from one person -- I forget who, sorry -- who claimed 
that C++ has the same behaviour, and if you want ASCII-only digits, you 
have to explicitly ask for it.


For what it's worth, Microsoft warns developers not to assume users will 
enter numeric data using ASCII digits:


Number representation can also use non-ASCII native digits, so your 
application may encounter characters other than 0-9 as inputs. Avoid 
filtering on U+0030 through U+0039 to prevent frustration for users who 
are trying to enter data using non-ASCII digits.


http://msdn.microsoft.com/en-us/magazine/cc163506.aspx


There was a similar discussion going on in Perl-land recently:

http://www.nntp.perl.org/group/perl.perl5.porters/2010/07/msg162400.html

although, being Perl, the discussion was dominated by concerns about 
regexes and implicit conversions, rather than an explicit call to 
float() or int() as we are discussing here.



[...]

In the same way, if I wanted to enter a number using non-Arabic digits,
it works provided I compromise by using the Anglo-American decimal point
instead of the European comma or the native decimal point I might prefer.


Why would you want that, if, what you really wanted, could not be
done. There certainly *is* a way to convert strings into floats,
and there would be a way if that restricted itself to the digits 0..9.
So it can't be the mere desire to convert strings to float that make
you ask for non-ASCII digits.


Why do Europeans use programming languages that force them to use a dot 
instead of a comma for the decimal place? Why do I misspell 
string.centre as string.center? Because if you want to get something 
done, you use the tools you have and not the tools you'd like to have.





--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Stephen J. Turnbull
Lennart Regebro writes:
  On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull step...@xemacs.org 
  wrote:
   Sure you can.  In Python program text, all keywords will be ASCII
  
  Yes, yes, sure, but not the contents of variables,

Irrelevant, you're not converting these to a string representation.
If you're generating numerals for internal use, I don't see why you
would want to do arithmetic on them; conversion is a YAGNI.  This is
only interesting to allow naive users to input in a comfortable way.

As yet there is no evidence that there are *any* such naive users, 1.3
billion of possibles are shut out, and at least two cultures which
use non-ASCII numerals every day, representing 1.3 billion naive users
(the coincidence of numbers is no coincidence), have reported that
nobody in their right mind would would *input* the numbers that way,
and at least for Japanese, the use cases are not really numeric anyway.

   I see no reason not to make a similar promise for numeric literals.
  
  Wait what, literas?

Sorry, my bad.

  Why would this be a problem:
  
   T1234 = float('.~~')
   T1234
  1234.56
  
  But this OK?
  
   T = float('1234.56')
   T
  1234.56

(Sorry, the Arabic is going to get munged, my mailer is beta and
somebody screwed up.)

Because the characters in the identifier are uninterpreted and have no
syntactic content other than their identity.  They're arbitrary.
That's not true of numerics.

Because that works, but

print(T1234)

doesn't (it prints ASCII).  You can't round-trip, but users will
want/expect that.

Because that works but this doesn't:

T1000 = float('一.◯◯◯')

Violates TOOWTDI.

If you're proposing to fix the numeric parsers, I still don't like it
but I could go to -0 on it.  However as Alexander points out and MAL
admits, it's apparently not so easy to do that.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Alexander Belopolsky
On Wed, Dec 1, 2010 at 7:17 PM, Steven D'Aprano st...@pearwood.info wrote:
..
 we should continue to support the existing behaviour. None of the arguments
 against it seem convincing to me, particularly since the opponents of the
 current behaviour admit that there is a use-case for it, but they just want
 it to move elsewhere, such as the locale module.


I don't remember who made this argument, but I think you misunderstood
it.  The argument was that if there was a use case for parsing Eastern
Arabic numerals, it would be better served by a module written by
someone who speaks one of the Arabic languages and knows the details
of how  Eastern Arabic numerals are written.  So far nobody has even
claimed to know conclusively that Arabic-Indic digits are always
written left-to-right.

 unicodedata.bidirectional('٤')
'AN'

is not very helpful because it means any Arabic-Indic digit
according to unicode.org.  (To me, a special category hints that it
may be written in either direction and the proper interpretation may
depend on context.)   I have not seen a real use case reported in this
thread and for theoretical use cases, the current implementation is
either outright wrong or does not solve the problem completely. Given
that a function that replaces all Unicode digits in a string with 0-9
can be written in 3 lines of Python code, it is very unlikely that
anyone would prefer to rely on undocumented behavior of Python
builtins instead of having explicit control over parsing of their
data.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Stephen J. Turnbull
Steven D'Aprano writes:

  With full respect to haiyang kang, hear-say from one person can hardly 
  be described as strong evidence

That's *disrespectful* nonsense.  What Haiyang reported was not
hearsay, it's direct observation of what he sees around him and
personal experience, plus extrapolation.  Look up hearsay, please.

Furthermore, he provided good *objective* reason (excessive cost, to
which I can also testify, in several different input methods for
Japanese) why numbers simply would not be input that way.

What's left is copy/paste via the mouse.  I assure you, every day I
see dozens of Japanese copy/pasting *only* ASCII numerals, and the
sales figures for Microsoft Excel (not to mention the download numbers
for Open Office) strongly suggest that 30 million Japanese salarymen
are similarly dedicated to ASCII.  (That's not hearsay either,
that's direct observation and extrapolation, which is more than the
we need float to translate Arabic supporters can offer.)

I have seen only *one* use case: it's a toy for sophisticated
programmers who want to think of themselves as broadminded.  We've
seen several examples of that in this thread, so I can't deny that is
a real use case.

Please, give us just *one* more real use case that isn't somebody
might.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Ben Finney
Stephen J. Turnbull step...@xemacs.org writes:

 Furthermore, he provided good *objective* reason (excessive cost, to
 which I can also testify, in several different input methods for
 Japanese) why numbers simply would not be input that way.

 What's left is copy/paste via the mouse.

For direct entry by an interactive user, yes. Why are some people in
this discussion thinking only of direct entry by an interactive user?

Input to a program comes from various sources other than direct entry by
the interactive user, as has been pointed out many times.

 Please, give us just *one* more real use case that isn't somebody
 might.

Input from an existing text file, as I said earlier. Or any other way of
text data making its way into a Python program.

Direct entry at the console is a red herring.

-- 
 \   “First things first, but not necessarily in that order.” —The |
  `\  Doctor, _Doctor Who_ |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Terry Reedy

On 12/1/2010 7:44 PM, Alexander Belopolsky wrote:


it.  The argument was that if there was a use case for parsing Eastern
Arabic numerals, it would be better served by a module written by
someone who speaks one of the Arabic languages and knows the details
of how  Eastern Arabic numerals are written.  So far nobody has even
claimed to know conclusively that Arabic-Indic digits are always
written left-to-right.


Both my personal observations when travelling from Turkey to India and 
Wikipedia say yes. When representing a number in Arabic, the 
lowest-valued position is placed on the right, so the order of positions 
is the same as in left-to-right scripts.

https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Alexander Belopolsky
On Wed, Dec 1, 2010 at 10:11 PM, Terry Reedy tjre...@udel.edu wrote:
 On 12/1/2010 7:44 PM, Alexander Belopolsky wrote:

 it.  The argument was that if there was a use case for parsing Eastern
 Arabic numerals, it would be better served by a module written by
 someone who speaks one of the Arabic languages and knows the details
 of how  Eastern Arabic numerals are written.  So far nobody has even
 claimed to know conclusively that Arabic-Indic digits are always
 written left-to-right.

 Both my personal observations when travelling from Turkey to India and
 Wikipedia say yes. When representing a number in Arabic, the lowest-valued
 position is placed on the right, so the order of positions is the same as in
 left-to-right scripts.
 https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals

This matches my limited research on this topic as well.  However, I am
not sure that when these codes are embedded in Arabic text, their
logical order always matches their display order.  It seems to me that
it can go either way depending on the surrounding text and/or presence
of explicit formatting codes.  Also, I don't understand why Eastern
Arabic-Indic digits have the same Bidi-Class as European digits, but
Arabic-Indic digits, Arabic decimal and thousands separators have
Bidi-Class AN.

http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread Stephen J. Turnbull
Ben Finney writes:

  Input from an existing text file, as I said earlier. Or any other way of
  text data making its way into a Python program.

  Direct entry at the console is a red herring.

I don't think it is.  Not at all.  Here's why: '''print %d %
some_integer''' doesn't now, and never will (unless Kristan gets his
Python 2.8wink), produce Arabic or Han numerals.  Not in any
language I know of, not in Microsoft Excel, and definitely not in
Python 2.  *Somebody* typed that text at some point.  If it's Han,
that somebody had *way* too much time on his hands, not a working
accountant nor a graduate assistant in a research lab for sure.

How about old archived texts, copied and recopied?  At least for
Japanese, old archival (text) data will *all* be in ASCII, because the
earliest implementations of Japanese language text used JIS X 0201 (or
its predecessor), which doesn't have Han digits (and kana digits don't
exist even if you write with a brush and ink AFAIK).  Ditto Arabic, I
would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the
Arabic digits that have been presented here earlier AFAICT.  Note that
there's plenty of space for them in that code table (eg, 0xB0-0xB9 is
empty).  Apparently nobody *ever* thought it was useful to have them!

So, which culture, using which script and in which application, inputs
numeric data in other than ASCII digits?  Or would want to, if only
somebody would tell them they can do it in Python?  Hearsay will do,
for starters.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Lennart Regebro
On Sun, Nov 28, 2010 at 21:24, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
 While we have little choice but to follow UCD in defining
 str.isidentifier(), I think Python can promise users more stability in
 what it treats as space or as a digit in its builtins.

Why? I can see this is a problem if one character that earlier was
allowed no longer is. That breaks backwards compatibility. This
doesn't.

 float('١٢٣٤.٥٦')
 1234.56

 is more important than to assure users that once their program
 accepted some text as a number, they can assume that the text is
 ASCII.

*I* think it is more important. In python 3, you can never ever assume
anything is ASCII any more. ASCII is practically dead an buried as far
as Python goes, unless you explicitly encode to it.

 def deposit(self, amountstr):
   self.balance += float(amountstr)
   audit_log(Deposited:  + amountstr)

 Auditor:

 $ cat numbered-account.log
 Deposited: ?.??

That log reasonably should be in UTF-8 or something else, in which
case this is not a problem. And that's ignoring that it makes way more
sense to log the numerical amount.

-- 
Lennart Regebro: http://regebro.wordpress.com/
Python 3 Porting: http://python3porting.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Hagen Fürstenau
 During PEP 3003 discussion, it was suggested to handle it on a case by
 case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP
 3003.
 
 It's covered by As the standard library is not directly tied to the
 language definition it is not covered by this moratorium.

How is this restricted to the stdlib if it defines the set of valid
identifiers?

- Hagen

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Stephen J. Turnbull
Lennart Regebro writes:

  *I* think it is more important. In python 3, you can never ever assume
  anything is ASCII any more.

Sure you can.  In Python program text, all keywords will be ASCII
(English, even, though it may be en_NL.UTF-8wink) for the forseeable
future.

I see no reason not to make a similar promise for numeric literals.  I
see no good reason to allow compatibility full-width Japanese ASCII
numerals or Arabic cursive numerals in for i in range(...) for
example.

As soon as somebody gives an example of a culture, however minor, that
uses computers but actively prefers to use non-ASCII numerals to
express numbers in an IT context, I'll review my thinking.  But at the
moment it's 101% YAGNI.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread haiyang kang
hi,

  I agree with this.

  I never seen any man in China using chinese number literals (at
least two kinds:一, 壹, same meaning with 1)
  in Python program, except UI output.

  They can do some mappings when want to output these non-ascii numbers.
  Example: if 1: print 一

  I think it is a little ugly to have code like this: num =
float(一.一), expected result is: num = 1.1

br,
khy

On Tue, Nov 30, 2010 at 4:23 PM, Stephen J. Turnbull step...@xemacs.org wrote:
 Lennart Regebro writes:

   *I* think it is more important. In python 3, you can never ever assume
   anything is ASCII any more.

 Sure you can.  In Python program text, all keywords will be ASCII
 (English, even, though it may be en_NL.UTF-8wink) for the forseeable
 future.

 I see no reason not to make a similar promise for numeric literals.  I
 see no good reason to allow compatibility full-width Japanese ASCII
 numerals or Arabic cursive numerals in for i in range(...) for
 example.

 As soon as somebody gives an example of a culture, however minor, that
 uses computers but actively prefers to use non-ASCII numerals to
 express numbers in an IT context, I'll review my thinking.  But at the
 moment it's 101% YAGNI.
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/cornsea%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Steven D'Aprano

haiyang kang wrote:

hi,

  I agree with this.

  I never seen any man in China using chinese number literals (at
least two kinds:一, 壹, same meaning with 1)
  in Python program, except UI output.

  They can do some mappings when want to output these non-ascii numbers.
  Example: if 1: print 一

  I think it is a little ugly to have code like this: num =
float(一.一), expected result is: num = 1.1


I don't expect that anyone would sensibly write code like that, except 
for testing. You wouldn't write num = float(1.1) instead of just

num = 1.1 either.

But you should be able to write:

text = input(Enter a number using your preferred digits: )
num = float(text)

without caring whether the user enters 一.一 or 1.1 or something else.


--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Steven D'Aprano

Stephen J. Turnbull wrote:

Lennart Regebro writes:

  *I* think it is more important. In python 3, you can never ever assume
  anything is ASCII any more.

Sure you can.  In Python program text, all keywords will be ASCII
(English, even, though it may be en_NL.UTF-8wink) for the forseeable
future.

I see no reason not to make a similar promise for numeric literals.  I
see no good reason to allow compatibility full-width Japanese ASCII
numerals or Arabic cursive numerals in for i in range(...) for
example.


I agree with you that numeric *literals* should be restricted to the 
ASCII digits. I don't think anyone here is arguing differently -- if 
they are, they should speak up and try to make the case for allowing 
numeric literals in arbitrary scripts. Python doesn't currently allow 
non-ASCII numeric literals, and even if such a change were desirable, it 
would run up against the moratorium. So let's just forget the specter of 
code like:


x = math.sqrt(١٢٣٤.٥٦ ** 一.一)

It ain't gonna happen :)


But I think there is a good case for allowing the constructors int, 
float and complex to continue to accept numeric *strings* with non-ASCII 
 digits. The code already exists, there's probably people out there who 
rely on it, and in the absence of any convincing demonstration that the 
existing behaviour is causing widespread difficulty, we should leave 
well-enough alone.


Various people have suggested that there should be a function in the 
locale module that handles numeric string input in non-ASCII digits. 
This is a de facto admission that there are use-cases for taking user 
input like the string '٣' and turning it into the int 3. Python can 
already do this, and has been able to for many years:


[st...@sylar ~]$ python2.4
Python 2.4.6 (#1, Mar 30 2009, 10:08:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type help, copyright, credits or license for more information.
 int(u'٣')
3

It seems to me that there's no need to move this functionality into locale.


--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Antoine Pitrou
On Wed, 01 Dec 2010 00:23:22 +1100
Steven D'Aprano st...@pearwood.info wrote:
 
 But I think there is a good case for allowing the constructors int, 
 float and complex to continue to accept numeric *strings* with non-ASCII 
   digits. The code already exists, there's probably people out there who 
 rely on it, and in the absence of any convincing demonstration that the 
 existing behaviour is causing widespread difficulty, we should leave 
 well-enough alone.

+1

 It seems to me that there's no need to move this functionality into locale.

Not only, but moving it into locale won't make it easier to maintain
anyway.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Tue, Nov 30, 2010 at 7:59 AM, Steven D'Aprano st...@pearwood.info wrote:
..
 But you should be able to write:

 text = input(Enter a number using your preferred digits: )
 num = float(text)

 without caring whether the user enters 一.一 or 1.1 or something else.


I find it ironic that people who argue for preservation of the current
behavior do it without checking what it actually is:

 float('一.一')
..
UnicodeEncodeError: 'decimal' codec can't encode character '\u4e00' ..

This one of the biggest problems with this feature.  It does not fit
user's expectations.  Even the original author of the decimal codec
expected the above to work. [1]

 Python can already do this, and has been able to for many years:
  int(u'٣')
 3

but you can do this without support from int() as well:

 import unicodedata
 unicodedata.digit('٣')
3

and for Unihan numbers, you can do
 unicodedata.numeric('一')
1.0

and

 unicodedata.numeric('ⅷ')
8.0

and if you are so inclined,

 [unicodedata.numeric(c) for c in ↂ ↁ ⅗ ⅞ ij.split()]
[1.0, 5000.0, 0.6, 0.875, 9.0]

Do you want to see all these supported by float()?

[1] makeunicodedata.py does not support Unihan digit data
http://bugs.python.org/issue10575
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread haiyang kang
 But you should be able to write:

 text = input(Enter a number using your preferred digits: )
 num = float(text)

 without caring whether the user enters 一.一 or 1.1 or something else.

yes. from logical point of view, this can happen.

But i really doubt that if really there are users who would like to
input number like that,
means that they first use google pinyin method to input 一, then change
to english input method to input . , then change to google pinyin
again for the other 一;
 or maybe you mean they input the whole  一.一 words with google pinyin
input method.

To input 1, users only need to type one time keyboard, but to input 一,
they need to type three times (yi SPACE).

Of course, users can also input something accidentally, but we just
need to give them some kind reminders.

At least coders in my around will restrain their system users to input
numbers with ASCII,
and seems that users are still happy with the ASCII type numbers :).

br,
khy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Mon, Nov 29, 2010 at 4:13 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 - Should Python documentation refer to the specific version of Unicode
 that it supports?

 You mean, mention it somewhere? Sure (although it would be nice if the
 documentation generator would automatically extract it from the source,
 just as it extracts the Python version number).

 Of course, such mentioning should explain that this is specific to
 CPython, and not an aspect of Python-the-language.

 Current documentation refers to old versions.  Should version be
 updated or removed to imply the latest?

 What specific reference are you referring to?

I found two places: A reference to Unicode 3.0 (!) in the Data Model
section and a reference to 5.2.0 in unicodedata docs.

See http://mail.python.org/pipermail/docs/2010-November/002074.html

 - How UCD updates should be handled during the language moratorium?

 It's clearly not affected.


This is not what Guido said last year:

 One question:

 There are currently number of patch waiting on the tracker for
 additional Unicode feature support and it's also likely that we'll
 want to upgrade to a more recent Unicode version within the
 next few years.

 How would such indirect changes be seen under the moratorium ?

That would fall under the Case-by-Case Exemptions section. Within the
next few years sounds like it might well wait until the moratorium is
ended though. :-)


http://mail.python.org/pipermail/python-dev/2009-November/093666.html

I don't see it as a big deal, but technically speaking, with Unicode
6.0 changing properties of two characters to become identifiers Python
language definition is affected.  For example, an alternative
implementation based on 5.2.0 will not accept a valid CPython program
that uses one of these characters.

 During PEP 3003 discussion, it was suggested to handle it on a case by
 case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP
 3003.

 It's covered by As the standard library is not directly tied to the
 language definition it is not covered by this moratorium.


See above.  Also, it has been suggested that semantics of built-ins
cannot change.  (If that was so, it would put int('١٢٣٤') debate to
rest at least for the time being.:-)

  Should this upgrade be backported to 2.7?

 No, it's a new feature.

Given that 2.7 will be maintained for 5 years and arguably Unicode
Consortium takes backward compatibility very seriously, wouldn't it
make sense to consider a backport at some point?

I am sure we will soon see a bug report that the following does not
work in 2.7: :-)
 ord('\N{CAT FACE WITH WRY SMILE}')
128572


 - How specific should library reference manual be in defining methods
 affected by UCD such as str.upper()?

 It should specify what this actually does in Unicode terminology
 (probably in addition to a layman's rephrase of that)


I opened an issue for this:

http://bugs.python.org/issue10587

 .. For example, if '\U'.isalpha() returns true
 in one implementation, can it return false in another?

 Implementations are free to use any version of the UCD.

I was more concerned about wide an narrow unicode CPython builds.  Is
it a bug that   '\U'.isalpha() may disagree even when the two
implementations are based on the same version of UCD?


Thanks for your answers.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang corn...@gmail.com wrote:
 But you should be able to write:

 text = input(Enter a number using your preferred digits: )
 num = float(text)

 without caring whether the user enters 一.一 or 1.1 or something else.

 yes. from logical point of view, this can happen. ...

Please stop discussing a non-feature.  Python's float *does not*
accept ' 一.一'.  This was reported as a bug and closed as invalid.

See makeunicodedata.py does not support Unihan digit data
http://bugs.python.org/issue10575
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Stefan Krah
Alexander Belopolsky alexander.belopol...@gmail.com wrote:
 On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang corn...@gmail.com wrote:
  But you should be able to write:
 
  text = input(Enter a number using your preferred digits: )
  num = float(text)
 
  without caring whether the user enters 一.一 or 1.1 or something else.
 
  yes. from logical point of view, this can happen. ...
 
 Please stop discussing a non-feature.  Python's float *does not*
 accept ' 一.一'.  This was reported as a bug and closed as invalid.

That seems irrelevant to me. One of the main topics of this thread is
whether actual native speakers would be happy with ascii-only input for
float().

haiyang kang confirmed that this is the case. I hope that more
local speakers will contribute their views.


Stefan Krah


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Mon, Nov 29, 2010 at 2:38 PM, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
..
 Still, if it's not detrimental and it it's not difficult to support,
 then why do you care?

 It is difficult to support.  A fix for issue10557 would be much
 simpler if we did not support non-European digits.  I now added a
 patch that handles non-ascii digits, so you can see what's involved.
 Note that when Unicode Consortium inevitably adds more Nd characters
 to the non-BMP planes, we will have to add surrogate pairs' support to
 this code.


It turns out that this did in fact happen:

# Newly assigned in Unicode 3.1.0 (March, 2001)
..
1D7CE..1D7FF  ; 3.1 #  [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL
MONOSPACE DIGIT NINE

See http://unicode.org/Public/UNIDATA/DerivedAge.txt

And of course,

 unicodedata.digit('\U0001D7CE')
0

but

 int('\U0001D7CE')
..
UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' ..

on a narrow Unicode build.  (Note the character reported in the error message!)


If you think non-ASCII digits are not difficult to support, please
contribute to the following tracker issues:

http://bugs.python.org/issue10581
(Review and document string format accepted in numeric data type constructors)

http://bugs.python.org/issue10557
(Malformed error message from float())

http://bugs.python.org/issue10435
(Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal)

http://bugs.python.org/issue8646
(PyUnicode_EncodeDecimal is undocumented)

http://bugs.python.org/issue6632
(Include more fullwidth chars in the decimal codec)

and back to the issue of user confusion

http://bugs.python.org/issue652104 [closed/invalid]
(int(u\u1234) raises UnicodeEncodeError by Guido van Rossum)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Michael Foord

On 30/11/2010 16:40, Alexander Belopolsky wrote:

[snip...]
And of course,


unicodedata.digit('\U0001D7CE')

0

but


int('\U0001D7CE')

..
UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' ..

on a narrow Unicode build.  (Note the character reported in the error message!)


If you think non-ASCII digits are not difficult to support, please
contribute to the following tracker issues:



Would moving this functionality to the locale module make the issues any 
easier to fix?


Michael


http://bugs.python.org/issue10581
(Review and document string format accepted in numeric data type constructors)

http://bugs.python.org/issue10557
(Malformed error message from float())

http://bugs.python.org/issue10435
(Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal)

http://bugs.python.org/issue8646
(PyUnicode_EncodeDecimal is undocumented)

http://bugs.python.org/issue6632
(Include more fullwidth chars in the decimal codec)

and back to the issue of user confusion

http://bugs.python.org/issue652104 [closed/invalid]
(int(u\u1234) raises UnicodeEncodeError by Guido van Rossum)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk



--

http://www.voidspace.org.uk/

READ CAREFULLY. By accepting and reading this email you agree,
on behalf of your employer, to release me from all obligations
and waivers arising from any and all NON-NEGOTIATED agreements,
licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap,
confidentiality, non-disclosure, non-compete and acceptable use
policies (”BOGUS AGREEMENTS”) that I have entered into with your
employer, its partners, licensors, agents and assigns, in
perpetuity, without prejudice to my ongoing rights and privileges.
You further represent that you have the authority to release me
from any BOGUS AGREEMENTS on behalf of your employer.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Tue, Nov 30, 2010 at 12:40 PM, Michael Foord
fuzzy...@voidspace.org.uk wrote:
..
 If you think non-ASCII digits are not difficult to support, please
 contribute to the following tracker issues:


 Would moving this functionality to the locale module make the issues any
 easier to fix?


Sure, if we code it in Python, supporting it will by much easier:

def normalize_digits(s):
digits = {m.group(1) for m in re.finditer('(\d)', s)}
trtab = {ord(d): str(unicodedata.digit(d)) for d in digits}
return s.translate(trtab)

 normalize_digits('١٢٣٤.٥٦')
'1234.56'

I am not sure this belongs to the locale module, however.  It seems to
me, something like 'unicodealgo' for unicode algorithms would be more
appropriate.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Antoine Pitrou

 Sure, if we code it in Python, supporting it will by much easier:
 
 def normalize_digits(s):
 digits = {m.group(1) for m in re.finditer('(\d)', s)}
 trtab = {ord(d): str(unicodedata.digit(d)) for d in digits}
 return s.translate(trtab)
 
  normalize_digits('١٢٣٤.٥٦')
 '1234.56'
 
 I am not sure this belongs to the locale module, however.  It seems to
 me, something like 'unicodealgo' for unicode algorithms would be more
 appropriate.

It could simply be in unicodedata if you split the implementation into a
core C part and some Python bits.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Alexander Belopolsky
On Tue, Nov 30, 2010 at 1:29 PM, Antoine Pitrou solip...@pitrou.net wrote:
..
 I am not sure this belongs to the locale module, however.  It seems to
 me, something like 'unicodealgo' for unicode algorithms would be more
 appropriate.

 It could simply be in unicodedata if you split the implementation into a
 core C part and some Python bits.


Splitting unicodedata may not be a bad idea.  There are many more
pieces in UCD than covered by unicodedata. [1]  Hardcoding them all
into unicodedata module is hard to justify, but some are quite useful.
 For example, PropertyValueAliases.txt is quite useful for those like
myself who cannot remember what Pd or Zl category names stand for.
SpecialCasing.txt is required for proper casing, but is not currently
included in Python.  I would not want to change str.upper or str.title
because of this, but providing the raw info to someone who wants to
implement proper case mappings may not be a bad idea.  Blocks.txt is
certainly useful for any language-dependent processing.

On the other hand, I think we should keep Unicode data and Unicode
algorithms separate.  And the latter may not even belong to the Python
stdlib.

[1] http://unicode.org/Public/UNIDATA/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
Am 30.11.2010 09:15, schrieb Hagen Fürstenau:
 During PEP 3003 discussion, it was suggested to handle it on a case by
 case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP
 3003.

 It's covered by As the standard library is not directly tied to the
 language definition it is not covered by this moratorium.
 
 How is this restricted to the stdlib if it defines the set of valid
 identifiers?

The language does not change. The language specification says

Python 3.0 introduces additional characters from outside the ASCII range
(see PEP 3131). For these characters, the classification uses the
version of the Unicode Character Database as included in the unicodedata
module.

That remains unchanged. It was a deliberate design decision of PEP 3131
to not codify a fixed set of characters that can be used in identifiers.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
 Would moving this functionality to the locale module make the issues any
 easier to fix?

You could delegate it to the C library, so: yes.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Antoine Pitrou
Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit :
  Would moving this functionality to the locale module make the issues any
  easier to fix?
 
 You could delegate it to the C library, so: yes.

I hope you don't suggest delegating it to the C locale functions.
Do you?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
Am 30.11.2010 20:23, schrieb Antoine Pitrou:
 Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit :
 Would moving this functionality to the locale module make the issues any
 easier to fix?

 You could delegate it to the C library, so: yes.
 
 I hope you don't suggest delegating it to the C locale functions.
 Do you?

Yes, I do. Why do you hope I don't?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Antoine Pitrou
Le mardi 30 novembre 2010 à 20:40 +0100, Martin v. Löwis a écrit :
 Am 30.11.2010 20:23, schrieb Antoine Pitrou:
  Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit :
  Would moving this functionality to the locale module make the issues any
  easier to fix?
 
  You could delegate it to the C library, so: yes.
  
  I hope you don't suggest delegating it to the C locale functions.
  Do you?
 
 Yes, I do. Why do you hope I don't?

Because we all know how locale is a pile of cr*p, both in specification
and in implementations. Our unit tests for it are a clear proof of that.

Actually, I remember you saying that locale should ideally be replaced
with a wrapper around the ICU library.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
 Because we all know how locale is a pile of cr*p, both in specification
 and in implementations. Our unit tests for it are a clear proof of that.

I wouldn't use expletives, but rather claim that the locale module is
highly platform-dependent.

 Actually, I remember you saying that locale should ideally be replaced
 with a wrapper around the ICU library.

By that, I stand - however, I have given up the hope that this will
happen anytime soon.

Wrt. to local number parsing, I think that the locale module would be
way better than the nonsense that Python currently does. In the locale
module, somebody at least has thought about what specifically
constitutes a number. The current not-ASCII-but-not-local-either
approach is just useless.

Maintaining a reasonable implementation is a burden, so deferring
to the C library is more attractive than having to maintain an
unreasonable implementation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Antoine Pitrou
Le mardi 30 novembre 2010 à 20:55 +0100, Martin v. Löwis a écrit :
 Wrt. to local number parsing, I think that the locale module would be
 way better than the nonsense that Python currently does. In the locale
 module, somebody at least has thought about what specifically
 constitutes a number. The current not-ASCII-but-not-local-either
 approach is just useless.

It depends what you need. If you parse integers it's probably good
enough. And it's better to have a trustable standard (unicode) than a
myriad of ad-hoc, possibly buggy or incomplete, often unavailable,
cultural specifications drafted by OS vendors who have no business (and
no expertise) in drafting them.

At least you can build more sophisticated routines on the simple
information given to you by the unicode database. You cannot build
anything solid on the C locale functions (and even then you are limited
by various issues inherent in the locale semantics, such as the fact
that it relies on process-wide state, which would only be ok, at best,
for single-user applications). There's a reason that e.g. Babel (*)
reimplements locale-like functionality from scratch.

(*) http://pypi.python.org/pypi/Babel/

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Ben Finney
haiyang kang corn...@gmail.com writes:

   I think it is a little ugly to have code like this: num =
 float(一.一), expected result is: num = 1.1

That's a straw man, though. The string need not be a literal in the
program; it can be input to the program.

num = float(input_from_the_external_world)

Does that change your assessment of whether non-ASCII digits are used?

-- 
 \“The greatest tragedy in mankind's entire history may be the |
  `\   hijacking of morality by religion.” —Arthur C. Clarke, 1991 |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Terry Reedy

On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:


I see no reason not to make a similar promise for numeric literals.  I
see no good reason to allow compatibility full-width Japanese ASCII
numerals or Arabic cursive numerals in for i in range(...) for
example.


I do not think that anyone, at least not me, has argued for anything 
other than 0-9 digits (or 0-f for hex) in literals in program code. The 
only issue is whether non-programmer *users* should be able to use their 
native digits in applications in response to input prompts.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
Am 30.11.2010 21:24, schrieb Ben Finney:
 haiyang kang corn...@gmail.com writes:
 
   I think it is a little ugly to have code like this: num =
 float(一.一), expected result is: num = 1.1
 
 That's a straw man, though. The string need not be a literal in the
 program; it can be input to the program.
 
 num = float(input_from_the_external_world)
 
 Does that change your assessment of whether non-ASCII digits are used?

I think the OP (haiyang kang) already indicated that he finds it quite
unlikely that anybody would possibly want to enter that. You would need
a number of key strokes to enter each individual ideograph, plus you
have to press the keys for keyboard layout switching to enter the Latin
decimal separator (which you normally wouldn't use along with the Han
numerals).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Martin v. Löwis
Am 30.11.2010 23:43, schrieb Terry Reedy:
 On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
 
 I see no reason not to make a similar promise for numeric literals.  I
 see no good reason to allow compatibility full-width Japanese ASCII
 numerals or Arabic cursive numerals in for i in range(...) for
 example.
 
 I do not think that anyone, at least not me, has argued for anything
 other than 0-9 digits (or 0-f for hex) in literals in program code. The
 only issue is whether non-programmer *users* should be able to use their
 native digits in applications in response to input prompts.

And here, my observation stands: if they wanted to, they currently
couldn't - at least not for real numbers (and also not for integers
if they want to use grouping). So the presumed application of this
feature doesn't actually work, despite the presence of the feature it
was supposedly meant to enable.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-30 Thread Terry Reedy

On 11/30/2010 10:05 AM, Alexander Belopolsky wrote:

My general answers to the questions you have raised are as follows:

1. Each new feature release should use the latest version of the UCD as 
of the first beta release (or perhaps a week or so before). New chars 
are new features and the beta period can be used to (hopefully) iron out 
any bugs introduced by a new UCD version.


2. The language specification should not be UCD version specific. Martin 
pointed out that the definition of identifiers was intentionally written 
to not be, bu referring to 'current version' or some such. On the other 
hand, the UCD version used should be programatically discoverable, 
perhaps as an attribute of sys or str.


3.. The UCD should not change in bugfix releases. New chars are new 
features. Adding them in bugfix releases will introduce gratuitous 
imcompatibilities between releases. People who want the latest Unicode 
should either upgrade to the latest Python version or patch an older 
version (but not expect core support for any problems that creates).



Given that 2.7 will be maintained for 5 years and arguably Unicode
Consortium takes backward compatibility very seriously, wouldn't it
make sense to consider a backport at some point?

I am sure we will soon see a bug report that the following does not
work in 2.7: :-)

ord('\N{CAT FACE WITH WRY SMILE}')

128572


3 (cont). 2.7 is no different in that regard. It is feature frozen just 
like all other x.y releases. And that is the answer to any such report. 
If that code became valid in 2.7.2, for instance, it would still not 
work in 2.7 and 2.7.1. Not working is not a bug; working is a new 
feature introduced after 2.7 was released.



- How specific should library reference manual be in defining methods
affected by UCD such as str.upper()?


It should specify what this actually does in Unicode terminology
(probably in addition to a layman's rephrase of that)



I opened an issue for this:

http://bugs.python.org/issue10587


1,2 (cont). Good idea in general.


I was more concerned about wide an narrow unicode CPython builds.  Is
it a bug that   '\U'.isalpha() may disagree even when the two
implementations are based on the same version of UCD?



4. While the difference between narrow/wide builds of (CPython) x.y 
(which should have once constant UCD) cannot be completely masked, I 
appreciate and generally agree with  your efforts to minimize them. In 
some cases, there will be a conflict/tradeoff between eliminating this 
difference versus that.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >