Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Thu, 01 May 2014 21:42:21 -0700, Rustom Mody wrote:


 Whats the best cure for headache?
 
 Cut off the head

o_O

I don't think so.


 Whats the best cure for Unicode?
 
 Ascii

Unicode is not a problem to be solved.

The inability to write standard human text in ASCII is a problem, e.g. 
one cannot write

“ASCII For Dummies” © 2014 by Zöe Smith, now on sale 99¢

so even *Americans* cannot represent all their common characters in 
ASCII, let alone specialised characters from mathematics, science, the 
printing industry, and law. And even Americans sometimes need to write 
text in Foreign. Where is your ASCII now?

The solution is to have at least one encoding which contains the 
additional characters needed.

The plethora of such additional encodings is a problem. The solution is a 
single encoding that covers all needed characters, like Unicode, so that 
there is no need to handle multiple encodings.

The inability for plain text files to record metadata of what encoding 
they use is a problem. The solution is to standardize on a single, world-
wide encoding, like Unicode.


 Saying however that there is no headache in unicode does not make the
 headache go away:
 
 http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
 
 No I am not saying that the contents/style/tone are right. However
 people are evidently suffering the transition. Denying it is not a help.

Transitions are always more painful than after the transition has settled 
down. As I have said repeatedly, I look forward for the day when nobody 
but document archivists and academics need care about legacy encodings. 
But we're not there yet.


 And unicode consortium's ways are not exactly helpful to its own cause:
 Imagine the C standard committee deciding that adding mandatory garbage
 collection to C is a neat idea
 
 Unicode consortium's going from old BMP to current (6.0) SMPs to
 who-knows-what in the future is similar.

I don't see the connection.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:

 I dont know how one causally connects the 'headaches' but Ive seen -
 mojibake

Mojibake is certainly more common with multiple encodings, but the 
solution to that is Unicode, not ASCII.

In fact, in your blog post you even link to a post of mine where I 
explain that ASCII has gone through multiple backwards incompatible 
changes over the decades, which means you can have a limited form of 
mojibake even in pure ASCII. Between changes over various versions of 
ASCII, and ambiguous characters allowed by the standard, you needed some 
sort of out-of-band metadata to tell you whether they intended an @ or a 
`, a | or a ¬, a £ or a #, to mention only a few.

It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become 
an unambiguous standard. But that's okay, because that merely allowed 
people to create dozens of 7-bit and 8-bit variations on ASCII, all 
incompatible with each other, and *call them ASCII* regardless of the 
actual standard name.

Between ambiguities in actual ASCII, and common practice to label non-
ASCII as ASCII, I can categorically say that mojibake has always been 
possible in so-called plain text. If you haven't noticed it, it was 
because you were only exchanging documents with people who happened to 
use the same set of characters as you.


 - unicode 'number-boxes' (what are these called?) 

They are missing character glyphs, and they have nothing to do with 
Unicode. They are due to deficiencies in the text font you are using.

Admittedly with Unicode's 0x10 possible characters (actually more, 
since a single code point can have multiple glyphs) it isn't surprising 
that most font designers have neither the time, skill or desire to create 
a glyph for every single code point. But then the same applies even for 
more restrictive 8-bit encodings -- sometimes font designers don't even 
bother providing glyphs for *ASCII* characters.

(E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

 - Worst of all what we
 *dont* see -- how many others dont see what we see?

Again, this a deficiency of the font. There are very few code points in 
Unicode which are intended to be invisible, e.g. space, newline, zero-
width joiner, control characters, etc., but they ought to be equally 
invisible to everyone. No printable character should ever be invisible in 
any decent font.


 I never knew of any of this in the good ol days of ASCII

You must have been happy with a very impoverished set of symbols, then.


 ¶ Passive voice is often the best choice in the interests of political
 correctness
 
 It would be a pleasant surprise if everyone sees a pilcrow at start of
 line above

I do.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 ... even *Americans* cannot represent all their common characters in
 ASCII, let alone specialised characters from mathematics, science, the
 printing industry, and law.

Aside: What additional characters does law use that aren't in ASCII?
Section § and paragraph ¶ are used frequently, but you already
mentioned the printing industry. Are there other symbols?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Fri, May 2, 2014 at 6:45 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 - unicode 'number-boxes' (what are these called?)

 They are missing character glyphs, and they have nothing to do with
 Unicode. They are due to deficiencies in the text font you are using.

 Admittedly with Unicode's 0x10 possible characters (actually more,
 since a single code point can have multiple glyphs) it isn't surprising
 that most font designers have neither the time, skill or desire to create
 a glyph for every single code point. But then the same applies even for
 more restrictive 8-bit encodings -- sometimes font designers don't even
 bother providing glyphs for *ASCII* characters.

 (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

This is another area where Unicode has given us a great improvement
over the old method of giving satisfaction. Back in the 1990s on
OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a
simple square with no information, or (c) copied from some other font
(common with dingbats fonts). With Unicode, the standard is to show a
little box *with the hex digits in it*. Granted, those boxes are a LOT
more readable for BMP characters than SMP (unless your text is huge,
six digits in the space of one character will make them pretty tiny),
and a Unicode font will generally include all (or at least most) of
the BMP, but it's still better than having no information at all.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Ben Finney
Chris Angelico ros...@gmail.com writes:

 On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
  ... even *Americans* cannot represent all their common characters in
  ASCII, let alone specialised characters from mathematics, science,
  the printing industry, and law.

 Aside: What additional characters does law use that aren't in ASCII?
 Section § and paragraph ¶ are used frequently, but you already
 mentioned the printing industry. Are there other symbols?

ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE
REGISTERED SIGN), for instance.

-- 
 \ “I got some new underwear the other day. Well, new to me.” —Emo |
  `\   Philips |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Fri, May 2, 2014 at 7:16 PM, Ben Finney b...@benfinney.id.au wrote:
 Chris Angelico ros...@gmail.com writes:

 On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
  ... even *Americans* cannot represent all their common characters in
  ASCII, let alone specialised characters from mathematics, science,
  the printing industry, and law.

 Aside: What additional characters does law use that aren't in ASCII?
 Section § and paragraph ¶ are used frequently, but you already
 mentioned the printing industry. Are there other symbols?

 ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE
 REGISTERED SIGN), for instance.

Heh! I forgot about those. U+00A9 in particular has gone so mainstream
that it's easy to think of it not as I'm going to switch to my
'British English + Legal' dictionary now and just as This is a
critical part of the basic dictionary.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Jussi Piitulainen
Chris Angelico writes:

 (common with dingbats fonts). With Unicode, the standard is to show
 a little box *with the hex digits in it*. Granted, those boxes are a
 LOT more readable for BMP characters than SMP (unless your text is
 huge, six digits in the space of one character will make them pretty
 tiny), and a Unicode font will generally include all (or at least
 most) of the BMP, but it's still better than having no information

I needed to see such tiny numbers just today, just the four of them in
the tiny box. So I pressed C-+ a few times to _make_ the text huge,
obtained my information, and returned to my normal text size with C--.

Perfect. Usually all I need to know is that I have a character for
which I don't have a glyph, but this time I wanted to record the
number because I was testing things rather than reading the text.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Marko Rauhamaa
Ben Finney b...@benfinney.id.au:

 Aside: What additional characters does law use that aren't in ASCII?
 Section § and paragraph ¶ are used frequently, but you already
 mentioned the printing industry. Are there other symbols?

 ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE
 REGISTERED SIGN), for instance.

The em-dash is mapped on my keyboard — I use it quite often.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote:
 On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:
  - Worst of all what we
  *dont* see -- how many others dont see what we see?

 Again, this a deficiency of the font. There are very few code points in 
 Unicode which are intended to be invisible, e.g. space, newline, zero-
 width joiner, control characters, etc., but they ought to be equally 
 invisible to everyone. No printable character should ever be invisible in 
 any decent font.

Thats not what I meant.

I wrote http://blog.languager.org/2014/04/unicoded-python.html
 – mostly on a debian box.
Later on seeing it on a less heavily setup ubuntu box, I see
 ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
have become 'missing-glyph' boxes.

It leads me ask, how much else of what I am writing, some random reader 
has simply not seen?
Quite simply we can never know – because most are going to go away saying
mojibaked/garbled rubbish

Speaking of what you understood of what I said:
Yes invisible chars is another problem I was recently bitten by.
I pasted something from google into emacs' org mode.
Following that link again I kept getting a broken link.

Until I found that the link had an invisible char

The problem was that emacs was faithfully rendering that char according
to standard, ie invisibly!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Fri, 02 May 2014 19:01:44 +1000, Chris Angelico wrote:

 On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 ... even *Americans* cannot represent all their common characters in
 ASCII, let alone specialised characters from mathematics, science, the
 printing industry, and law.
 
 Aside: What additional characters does law use that aren't in ASCII?
 Section § and paragraph ¶ are used frequently, but you already mentioned
 the printing industry. Are there other symbols?

I was thinking of copyright, trademark, registered mark, and similar. I 
think these are all of relevant characters:

py for c in '©®℗™':
... unicodedata.name(c)
...
'COPYRIGHT SIGN'
'REGISTERED SIGN'
'SOUND RECORDING COPYRIGHT'
'TRADE MARK SIGN'



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote:

 On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote:
 On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:
  - Worst of all what we
  *dont* see -- how many others dont see what we see?
 
 Again, this a deficiency of the font. There are very few code points in
 Unicode which are intended to be invisible, e.g. space, newline, zero-
 width joiner, control characters, etc., but they ought to be equally
 invisible to everyone. No printable character should ever be invisible
 in any decent font.
 
 Thats not what I meant.
 
 I wrote http://blog.languager.org/2014/04/unicoded-python.html
  – mostly on a debian box.
 Later on seeing it on a less heavily setup ubuntu box, I see
  ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
 have become 'missing-glyph' boxes.
 
 It leads me ask, how much else of what I am writing, some random reader
 has simply not seen?
 Quite simply we can never know – because most are going to go away
 saying mojibaked/garbled rubbish
 
 Speaking of what you understood of what I said: Yes invisible chars is
 another problem I was recently bitten by. I pasted something from google
 into emacs' org mode. Following that link again I kept getting a broken
 link.
 
 Until I found that the link had an invisible char
 
 The problem was that emacs was faithfully rendering that char according
 to standard, ie invisibly!

And you've never been bitten by an invisible control character in ASCII 
text? You've lived a sheltered life!

Nothing you are describing is unique to Unicode.


-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 And you've never been bitten by an invisible control character in
 ASCII text? You've lived a sheltered life!

That reminds me:   (nonbreakable space) is often used between numbers
and units, for example.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Tim Chase
On 2014-05-02 19:08, Chris Angelico wrote:
 This is another area where Unicode has given us a great improvement
 over the old method of giving satisfaction. Back in the 1990s on
 OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a
 simple square with no information, or (c) copied from some other
 font (common with dingbats fonts). With Unicode, the standard is to
 show a little box *with the hex digits in it*. Granted, those boxes
 are a LOT more readable for BMP characters than SMP (unless your
 text is huge, six digits in the space of one character will make
 them pretty tiny), and a Unicode font will generally include all
 (or at least most) of the BMP, but it's still better than having no
 information at all.

I'm pleased when applications  fonts work properly, using both the
placeholder fonts for this character is legitimate but I can't
display it with a font, so here, have a box with the codepoint
numbers in it until I'm directed to use a more appropriate font at
which point you'll see it correctly and the somebody crammed garbage
in here, so I'll display it with � (U+FFFD) which is designated for
exactly this purpose.

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Friday, May 2, 2014 5:25:37 PM UTC+5:30, Steven D'Aprano wrote:
 On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote:

  On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote:
  On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:
   - Worst of all what we
   *dont* see -- how many others dont see what we see?
  Again, this a deficiency of the font. There are very few code points in
  Unicode which are intended to be invisible, e.g. space, newline, zero-
  width joiner, control characters, etc., but they ought to be equally
  invisible to everyone. No printable character should ever be invisible
  in any decent font.
  Thats not what I meant.
  I wrote http://blog.languager.org/2014/04/unicoded-python.html
   – mostly on a debian box.
  Later on seeing it on a less heavily setup ubuntu box, I see
   ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
  have become 'missing-glyph' boxes.
  It leads me ask, how much else of what I am writing, some random reader
  has simply not seen?
  Quite simply we can never know – because most are going to go away
  saying mojibaked/garbled rubbish
  Speaking of what you understood of what I said: Yes invisible chars is
  another problem I was recently bitten by. I pasted something from google
  into emacs' org mode. Following that link again I kept getting a broken
  link.
  Until I found that the link had an invisible char
  The problem was that emacs was faithfully rendering that char according
  to standard, ie invisibly!

 And you've never been bitten by an invisible control character in ASCII 
 text? You've lived a sheltered life!

For control characters Ive seen:
- garbage (the ASCII equiv of mojibake)
- Straight ^A^B^C
- Maybe their names NUL,SOH,STX,ETX,EOT,ENQ,ACK…
- Or maybe just a little dot .
- More pathological behavior: a control sequence putting the
  terminal into some other mode

But I dont ever remember seeing a control character become
invisible (except [ \t\n\f])
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread MRAB

On 2014-05-02 03:39, Ben Finney wrote:

Rustom Mody rustompm...@gmail.com writes:


Yes, the headaches go a little further back than Unicode.


Okay, so can you change your article to reflect the fact that the
headaches both pre-date Unicode, and are made much easier by Unicode?


There is a certain large old book...


Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta”
URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta.
Probably inspired by stories older than that, of course.


In which is described the building of a 'tower that reached up to heaven'...
At which point 'it was decided'¶ to do something to prevent that.
And our headaches started.


And other myths with fantastic reasons for the diversity of language
URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language.


I never knew of any of this in the good ol days of ASCII


Yes, by ignoring all other writing systems except one's own – and
thereby excluding most of the world's people – the system can be made
simpler.


ASCII lacked even £. I can remember assembly listings in magazines
containing lines such as:

LDA £0

I even (vaguely) remember an advert with a character that looked like
Ł, presumably because they didn't have £. In a UK magazine? Very
strange!


Hopefully the proportion of programmers who still feel they can make
such a parochial choice is rapidly shrinking.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Friday, May 2, 2014 5:25:37 PM UTC+5:30, Steven D'Aprano wrote:
 On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote:

  On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote:
  On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:
   - Worst of all what we
   *dont* see -- how many others dont see what we see?
  Again, this a deficiency of the font. There are very few code points in
  Unicode which are intended to be invisible, e.g. space, newline, zero-
  width joiner, control characters, etc., but they ought to be equally
  invisible to everyone. No printable character should ever be invisible
  in any decent font.
  Thats not what I meant.
  I wrote http://blog.languager.org/2014/04/unicoded-python.html
   – mostly on a debian box.
  Later on seeing it on a less heavily setup ubuntu box, I see
   ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊
  have become 'missing-glyph' boxes.
  It leads me ask, how much else of what I am writing, some random reader
  has simply not seen?
  Quite simply we can never know – because most are going to go away
  saying mojibaked/garbled rubbish
  Speaking of what you understood of what I said: Yes invisible chars is
  another problem I was recently bitten by. I pasted something from google
  into emacs' org mode. Following that link again I kept getting a broken
  link.
  Until I found that the link had an invisible char
  The problem was that emacs was faithfully rendering that char according
  to standard, ie invisibly!

 And you've never been bitten by an invisible control character in ASCII 
 text? You've lived a sheltered life!

 Nothing you are describing is unique to Unicode.

Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (fine, fine) = (1,2)
Prelude (fine, fine)
(1,2)
Prelude 

In case its not apparent, the fi in the first fine is a ligature.

Python just barfs:

 fine = 1
  File stdin, line 1
fine = 1
^
SyntaxError: invalid syntax
 

The point of that example is to show that unicode gives all kind of 
Aaah! Gotcha!! opportunities that just dont exist in the old world.
Python may have got this one right but there are surely dozens of others.

On the other hand I see more eagerness for unicode source-text there
eg.

https://github.com/i-tu/Hasklig
http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#unicode-syntax
http://www.haskell.org/haskellwiki/Unicode-symbols
http://hackage.haskell.org/package/base-unicode-symbols

Some music 턞 턢 ♭ 턱 to appease the utf-8 gods 



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread MRAB

On 2014-05-02 09:08, Steven D'Aprano wrote:

On Thu, 01 May 2014 21:42:21 -0700, Rustom Mody wrote:



Whats the best cure for headache?

Cut off the head


o_O

I don't think so.



Whats the best cure for Unicode?

Ascii


Unicode is not a problem to be solved.

The inability to write standard human text in ASCII is a problem, e.g.
one cannot write

“ASCII For Dummies” © 2014 by Zöe Smith, now on sale 99¢


[snip]

Shouldn't that be Zoë?

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Michael Torrie
On 05/02/2014 10:50 AM, Rustom Mody wrote:
 Python just barfs:
 
 fine = 1
   File stdin, line 1
 fine = 1
 ^
 SyntaxError: invalid syntax

 
 The point of that example is to show that unicode gives all kind of 
 Aaah! Gotcha!! opportunities that just dont exist in the old world.
 Python may have got this one right but there are surely dozens of others.

Except that it doesn't.  This has nothing to do with unicode handling.
It has everything to do with what defines an identifier in Python.  This
is no different than someone wondering why they can't start an
identifier in Python 1.x with a number or punctuation mark.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Ned Batchelder

On 5/2/14 12:50 PM, Rustom Mody wrote:

Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (fine, fine) = (1,2)
Prelude (fine, fine)
(1,2)
Prelude

In case its not apparent, the fi in the first fine is a ligature.

Python just barfs:


fine = 1

   File stdin, line 1
 fine = 1
 ^
SyntaxError: invalid syntax




Surely by now we could at least be explicit about which version of 
Python we are talking about?


  $ python2.7
  Python 2.7.2 (default, Oct 11 2012, 20:14:37)
  [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on 
darwin

  Type help, copyright, credits or license for more information.
   fine = 1
File stdin, line 1
  fine = 1
  ^
  SyntaxError: invalid syntax
   ^D
  $ python3.4
  Python 3.4.0b1 (default, Dec 16 2013, 21:05:22)
  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
  Type help, copyright, credits or license for more information.
   fine = 1
   fine
  1

In Python 2 identifiers must be ASCII.  Python 3 allows many Unicode 
characters in identifiers (see PEP 3131 for details: 
http://legacy.python.org/dev/peps/pep-3131/)


--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Peter Otten
Rustom Mody wrote:

 Just noticed a small thing in which python does a bit better than haskell:
 $ ghci
 let (fine, fine) = (1,2)
 Prelude (fine, fine)
 (1,2)
 Prelude
 
 In case its not apparent, the fi in the first fine is a ligature.
 
 Python just barfs:

Not Python 3:

Python 3.3.2+ (default, Feb 28 2014, 00:52:16) 
[GCC 4.8.1] on linux
Type help, copyright, credits or license for more information.
 (fine, fine) = (1,2)
 (fine, fine)
(2, 2)

No copy-and-paste errors involved:

 eval(\ufb01ne)
2
 eval(bfine.decode(ascii))
2


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Ben Finney
Marko Rauhamaa ma...@pacujo.net writes:

 That reminds me:   [U+00A0 NON-BREAKING SPACE] is often used between
 numbers and units, for example.

The non-breaking space (“ ” U+00A0) is frequently used in text to keep
conceptually inseparable text such as “100 km” from automatic word
breaks URL:https://en.wikipedia.org/wiki/Non-breaking_space.

Because of established, conflicting conventions for separating groups of
digits (“1,234.00” in many countries; “1.234,00” in many others)
URL:https://en.wikipedia.org/wiki/Thousands_separator#Digit_grouping,
the “ ” U+2009 THIN SPACE URL:https://en.wikipedia.org/wiki/Thin_Space
is recommended for separating digit groups (e.g. “1 234 567 m”)
URL:https://en.wikipedia.org/wiki/SI_units#General_rules.

-- 
 \   “We spend the first twelve months of our children's lives |
  `\  teaching them to walk and talk and the next twelve years |
_o__)   telling them to sit down and shut up.” —Phyllis Diller |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Roy Smith
In article mailman.9659.1399064866.18130.python-l...@python.org,
 Ben Finney b...@benfinney.id.au wrote:

 The non-breaking space (“ ” U+00A0) is frequently used in text to keep
 conceptually inseparable text such as “100 km” from automatic word
 breaks URL:https://en.wikipedia.org/wiki/Non-breaking_space.

Which, by the way, argparse doesn't honor...

http://bugs.python.org/issue16623
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote:
 Rustom Mody wrote:

  Just noticed a small thing in which python does a bit better than haskell:
  $ ghci
  let (fine, fine) = (1,2)
  Prelude (fine, fine)
  (1,2)
  In case its not apparent, the fi in the first fine is a ligature.
  Python just barfs:

 Not Python 3:

 Python 3.3.2+ (default, Feb 28 2014, 00:52:16) 
 [GCC 4.8.1] on linux
 Type help, copyright, credits or license for more information.
  (fine, fine) = (1,2)
  (fine, fine)
 (2, 2)

 No copy-and-paste errors involved:

  eval(\ufb01ne)
 2
  eval(bfine.decode(ascii))
 2

Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad.

I am confused about the tone however:
You think this

 (fine, fine) = (1,2) # and no issue about it

is fine?


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Sat, May 3, 2014 at 10:58 AM, Rustom Mody rustompm...@gmail.com wrote:
 You think this

 (fine, fine) = (1,2) # and no issue about it

 is fine?

Not sure which part you're objecting to. Are you saying that this
should be an error:

 a, a = 1, 2 # simple ASCII identifier used twice

or that Python should take the exact sequence of codepoints, rather
than normalizing?

Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
[GCC 4.7.2] on linux
Type help, copyright, credits or license for more information.
 fine = 1
 vars()
{'__package__': None, '__spec__': None, '__doc__': None, 'fine': 1,
'__loader__': class '_frozen_importlib.BuiltinImporter',
'__builtins__': module 'builtins' (built-in), '__name__':
'__main__'}

As regards normalization, I would be happy with either keep it
exactly as you provided or normalize according to insert Unicode
standard normalization here, as long as it's consistent. It's like
what happens with SQL identifiers: according to the standard, an
unquoted name should be uppercased, but some databases instead
lowercase them. It doesn't break code (modulo quoted names, not
applicable here), as long as it's consistent.

(My reading of PEP 3131 is that NFKC is used; is that what's
implemented, or was that a temporary measure and/or something for Py2
to consider?)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Ned Batchelder

On 5/2/14 8:58 PM, Rustom Mody wrote:

On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote:

Rustom Mody wrote:



Just noticed a small thing in which python does a bit better than haskell:
$ ghci
let (fine, fine) = (1,2)
Prelude (fine, fine)
(1,2)
In case its not apparent, the fi in the first fine is a ligature.
Python just barfs:



Not Python 3:



Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
[GCC 4.8.1] on linux
Type help, copyright, credits or license for more information.

(fine, fine) = (1,2)
(fine, fine)

(2, 2)



No copy-and-paste errors involved:



eval(\ufb01ne)

2

eval(bfine.decode(ascii))

2


Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad.

I am confused about the tone however:
You think this


(fine, fine) = (1,2) # and no issue about it


is fine?




Can you be more explicit?  It seems like you think it isn't fine.  Why 
not?  What bothers you about it?  Should there be an issue?


--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Saturday, May 3, 2014 6:48:21 AM UTC+5:30, Ned Batchelder wrote:
 On 5/2/14 8:58 PM, Rustom Mody wrote:
  On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote:
  Rustom Mody wrote:
  Just noticed a small thing in which python does a bit better than haskell:
  $ ghci
  let (fine, fine) = (1,2)
  Prelude (fine, fine)
  (1,2)
  In case its not apparent, the fi in the first fine is a ligature.
  Python just barfs:
  Not Python 3:
  Python 3.3.2+ (default, Feb 28 2014, 00:52:16)
  [GCC 4.8.1] on linux
  Type help, copyright, credits or license for more information.
  (fine, fine) = (1,2)
  (fine, fine)
  (2, 2)
  No copy-and-paste errors involved:
  eval(\ufb01ne)
  2
  eval(bfine.decode(ascii))
  2
  Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad.
  I am confused about the tone however:
  You think this
  (fine, fine) = (1,2) # and no issue about it
  is fine?

 Can you be more explicit?  It seems like you think it isn't fine.  Why 
 not?  What bothers you about it?  Should there be an issue?

Two identifiers that to some programmers
- can look the same
- and not to others
- and that the language treats as different

is not fine (or fine) to me.

Putting them together as I did is summarizing the problem.

Think of them textually widely separated.
And the code (un)serendipitously 'working' (ie not giving NameErrors)


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Sat, May 3, 2014 at 11:42 AM, Rustom Mody rustompm...@gmail.com wrote:
 Two identifiers that to some programmers
 - can look the same
 - and not to others
 - and that the language treats as different

 is not fine (or fine) to me.

The language treats them as the same, though.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Fri, 02 May 2014 17:58:51 -0700, Rustom Mody wrote:

 I am confused about the tone however: You think this
 
 (fine, fine) = (1,2) # and no issue about it
 
 is fine?


It's no worse than any other obfuscated variable name:

MOOSE, MO0SE, M0OSE = 1, 2, 3
xl, x1 = 1, 2

If you know your victim is reading source code in Ariel font, rn and 
m are virtually indistinguishable except at very large sizes.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Rustom Mody
On Saturday, May 3, 2014 7:24:08 AM UTC+5:30, Chris Angelico wrote:
 On Sat, May 3, 2014 at 11:42 AM, Rustom Mody wrote:
  Two identifiers that to some programmers
  - can look the same
  - and not to others
  - and that the language treats as different
  is not fine (or fine) to me.

 The language treats them as the same, though.

Whoops! I seem to be goofing a lot today

Saw Peter's

 (fine, fine) = (1,2) 

Didn't notice his next line
 (fine, fine)
(2, 2) 

So then I am back to my original point:

Python is giving better behavior than Haskell in this regard!

[Earlier reached this conclusion via a wrong path]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Steven D'Aprano
On Sat, 03 May 2014 02:02:32 +, Steven D'Aprano wrote:

 On Fri, 02 May 2014 17:58:51 -0700, Rustom Mody wrote:
 
 I am confused about the tone however: You think this
 
 (fine, fine) = (1,2) # and no issue about it
 
 is fine?
 
 
 It's no worse than any other obfuscated variable name:
 
 MOOSE, MO0SE, M0OSE = 1, 2, 3
 xl, x1 = 1, 2
 
 If you know your victim is reading source code in Ariel font, rn and
 m are virtually indistinguishable except at very large sizes.


Ooops! I too missed that Python normalises the name fine to fine, so in 
fact this is not a case of obfuscation. 



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Chris Angelico
On Sat, May 3, 2014 at 12:02 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 If you know your victim is reading source code in Ariel font, rn and
 m are virtually indistinguishable except at very large sizes.

I kinda like the idea of naming it after a bratty teenager who rebels
against her father and runs away from home, but normally the font's
called Arial. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-02 Thread Terry Reedy

On 5/2/2014 9:15 PM, Chris Angelico wrote:


(My reading of PEP 3131 is that NFKC is used; is that what's
implemented, or was that a temporary measure and/or something for Py2
to consider?)


The 3.4 docs say The syntax of identifiers in Python is based on the 
Unicode standard annex UAX-31, with elaboration and changes as defined 
below; see also PEP 3131 for further details.

...
All identifiers are converted into the normal form NFKC while parsing; 
comparison of identifiers is based on NFKC.


Without reading UAX-31, I don't know how much was changed, but I suspect 
not much. In any case, the current rules are intended and very unlikely 
to change as that would break code going either forward or back for 
little purpose.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread wxjmfauth
Le mercredi 30 avril 2014 20:48:48 UTC+2, Tim Chase a écrit :
 On 2014-04-30 00:06, wxjmfa...@gmail.com wrote:
 
  @ Time Chase
 
  
 
  I'm perfectly aware about what I'm doing.
 
 
 
 Apparently, you're quite adept at appending superfluous characters to
 
 sensible strings...did you benchmark your email composition, too? ;-)
 
 
 
 -tkc (aka Tim, not Time)

Mea culpa, ...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Thursday, May 1, 2014 10:30:43 AM UTC+5:30, Steven D'Aprano wrote:
 On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:

  On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
  While I dislike feeding the troll, what I see here is:
  Since its Unicode-troll time, here's my contribution
  http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

 Also your link to Joel On Software mistakenly links to me instead of Joel.
 There's a missing apostrophe in Ive [sic] in Acknowledgment #2.

Done, Done.

 I didn't notice any other typos.

Thank you sir!

 I point out that out of the two most widespread flavours of OS today, 
 Linux/Unix and Windows, it is *Windows* and not Unix which still 
 regularly uses legacy encodings.

Not sure what you are suggesting... 
That (I am suggesting that) 8859 is legacy and 1252 is not?

 I disagree with much of your characterisation of the Unix assumption,

I'd be interested to know the details -- Contents? Details? Tone? Tenor? 
Blaspheming the sacred scripture?
(if you are so inclined of course)


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Terry Reedy

On 5/1/2014 2:04 PM, Rustom Mody wrote:


Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


I will not comment on the Unix-assumption part, but I think you go wrong 
with this:  Unicode is a Headache. The major headache is that unicode 
and its very few encodings are not universally used. The headache is all 
the non-unicode legacy encodings still being used. So you better title 
this section 'Non-Unicode is a Headache'.


The first sentence is this misleading tautology: With ASCII, data is 
ASCII whether its file, core, terminal, or network; ie ABC is 
65,66,67. Let me translate: If all text is ASCII encoded, then text 
data is ASCII, whether ... But it was never the case that all text was 
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
still uses the latter. Other mainframe makers used other encodings of 
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
universal. You could have just as well said With EBCDIC, data is 
EBCDIC, whether ...


https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers, 
including the IBM PC. The latter was considered a toy by the mainframe 
guys. If they had known that PCs would partly take over the computing 
world, they might have suggested or insisted that the it use EBCDIC.


With unicode there are:
encodings
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not 
universal, all of the problems with *non-unicode* character sets and 
encodings would disappear. The pre-unicode declarations could then 
disappear. More truthful: without unicode there are 100s of encodings 
and with unicode only 3 that we should worry about.


in-memory formats

These are not the concern of the using programmer as long as they do not 
introduce bugs or limitations (as do all the languages stuck on UCS-2 
and many using UTF-16, including old Python narrow builds). Using what 
should generally be the universal transmission format, UFT-8, as the 
internal format means either losing indexing and slicing, having those 
operations slow from O(1) to O(len(string)), or adding an index table 
that is not part of the unicode standard. Using UTF-32 avoids the above 
but usually wasted space -- up to 75%.


strange beasties like python's FSR

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
is an *internal optimization* that benefits most unicode operations that 
people actually perform. It uses UTF-32 by default but adapts to the 
strings users create by compressing the internal format. The compression 
is trivial -- simple dropping leading null bytes common to all 
characters -- so each character is still readable as is. The string 
headers records how many bytes are left.  Is the idea of algorithms that 
adapt to inputs really strange to you?


Like good adaptive algorthms, the FSR is invisible to the user except 
for reducing space or time or maybe both. Unicode operations are 
otherwise the same as with previous wide builds. People who used to use 
narrow-builds also benefit from bug elimination. The only 'headaches' 
involved might have been those of the developers who optimized previous 
wide builds.


CPython has many other functions with special-case optimizations and 
'fast paths' for common, simple cases. For instance, (some? all?) number 
operations are optimized for pairs of integers.  Do you call these 
'strange beasties'?


PyPy is faster than CPython, when it is, because it is even more 
adaptable to particular computations by creating new fast paths. The 
mechanism to create these 'strange beasties' might have been a headache 
for the writers, but when it works, which it now seems to, it is not for 
the users.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread MRAB

On 2014-05-01 23:38, Terry Reedy wrote:

On 5/1/2014 2:04 PM, Rustom Mody wrote:


Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


I will not comment on the Unix-assumption part, but I think you go wrong
with this:  Unicode is a Headache. The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.


[snip]
I think he's right when he says Unicode is a headache, but only
because it's being used to handle languages which are, themselves, a
headache: left-to-right versus right-to-left, sometimes on the same
line; diacritics, possibly several on a glyph; etc.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 5:03:21 AM UTC+5:30, MRAB wrote:
 On 2014-05-01 23:38, Terry Reedy wrote:
  On 5/1/2014 2:04 PM, Rustom Mody wrote:
  Since its Unicode-troll time, here's my contribution
  http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
  I will not comment on the Unix-assumption part, but I think you go wrong
  with this:  Unicode is a Headache. The major headache is that unicode
  and its very few encodings are not universally used. The headache is all
  the non-unicode legacy encodings still being used. So you better title
  this section 'Non-Unicode is a Headache'.
 [snip]
 I think he's right when he says Unicode is a headache, but only
 because it's being used to handle languages which are, themselves, a
 headache: left-to-right versus right-to-left, sometimes on the same
 line; diacritics, possibly several on a glyph; etc.

Yes, the headaches go a little further back than Unicode.
There is a certain large old book...
In which is described the building of a 'tower that reached up to heaven'...

At which point 'it was decided'¶ to do something to prevent that.

And our headaches started.

I dont know how one causally connects the 'headaches' but Ive seen
- mojibake
- unicode 'number-boxes' (what are these called?)
- Worst of all what we *dont* see -- how many others dont see what we see?

I never knew of any of this in the good ol days of ASCII

¶ Passive voice is often the best choice in the interests of political 
correctness

It would be a pleasant surprise if everyone sees a pilcrow at start of line 
above
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote:
 On 5/1/2014 2:04 PM, Rustom Mody wrote:

  Since its Unicode-troll time, here's my contribution
  http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

 I will not comment on the Unix-assumption part, but I think you go wrong 
 with this:  Unicode is a Headache. The major headache is that unicode 
 and its very few encodings are not universally used. The headache is all 
 the non-unicode legacy encodings still being used. So you better title 
 this section 'Non-Unicode is a Headache'.

 The first sentence is this misleading tautology: With ASCII, data is 
 ASCII whether its file, core, terminal, or network; ie ABC is 
 65,66,67. Let me translate: If all text is ASCII encoded, then text 
 data is ASCII, whether ... But it was never the case that all text was 
 ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe 
 still uses the latter. Other mainframe makers used other encodings of 
 A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never 
 universal. You could have just as well said With EBCDIC, data is 
 EBCDIC, whether ...

 https://en.wikipedia.org/wiki/Ascii
 https://en.wikipedia.org/wiki/EBCDIC

 A crucial step in the spread of Ascii was its use for microcomputers, 
 including the IBM PC. The latter was considered a toy by the mainframe 
 guys. If they had known that PCs would partly take over the computing 
 world, they might have suggested or insisted that the it use EBCDIC.

 With unicode there are:
  encodings
 where 'encodings' is linked to
 https://en.wikipedia.org/wiki/Character_encodings_in_HTML

 If html 'always' used utf-8 (like xml), as has become common but not 
 universal, all of the problems with *non-unicode* character sets and 
 encodings would disappear. The pre-unicode declarations could then 
 disappear. More truthful: without unicode there are 100s of encodings 
 and with unicode only 3 that we should worry about.

 in-memory formats

 These are not the concern of the using programmer as long as they do not 
 introduce bugs or limitations (as do all the languages stuck on UCS-2 
 and many using UTF-16, including old Python narrow builds). Using what 
 should generally be the universal transmission format, UFT-8, as the 
 internal format means either losing indexing and slicing, having those 
 operations slow from O(1) to O(len(string)), or adding an index table 
 that is not part of the unicode standard. Using UTF-32 avoids the above 
 but usually wasted space -- up to 75%.

 strange beasties like python's FSR

 Have you really let yourself be poisoned by JMF's bizarre rants? The FSR 
 is an *internal optimization* that benefits most unicode operations that 
 people actually perform. It uses UTF-32 by default but adapts to the 
 strings users create by compressing the internal format. The compression 
 is trivial -- simple dropping leading null bytes common to all 
 characters -- so each character is still readable as is. The string 
 headers records how many bytes are left.  Is the idea of algorithms that 
 adapt to inputs really strange to you?

 Like good adaptive algorthms, the FSR is invisible to the user except 
 for reducing space or time or maybe both. Unicode operations are 
 otherwise the same as with previous wide builds. People who used to use 
 narrow-builds also benefit from bug elimination. The only 'headaches' 
 involved might have been those of the developers who optimized previous 
 wide builds.

 CPython has many other functions with special-case optimizations and 
 'fast paths' for common, simple cases. For instance, (some? all?) number 
 operations are optimized for pairs of integers.  Do you call these 
 'strange beasties'?

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?

JFTR I have no issue with FSR.  What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am 
one of them]

I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:

Why should I pay more for a EURO sign than a $ sign?

Well perhaps that is more related to the exchange rate than to python!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Ben Finney
Rustom Mody rustompm...@gmail.com writes:

 Yes, the headaches go a little further back than Unicode.

Okay, so can you change your article to reflect the fact that the
headaches both pre-date Unicode, and are made much easier by Unicode?

 There is a certain large old book...

Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta”
URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta.
Probably inspired by stories older than that, of course.

 In which is described the building of a 'tower that reached up to heaven'...
 At which point 'it was decided'¶ to do something to prevent that.
 And our headaches started.

And other myths with fantastic reasons for the diversity of language
URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language.

 I never knew of any of this in the good ol days of ASCII

Yes, by ignoring all other writing systems except one's own – and
thereby excluding most of the world's people – the system can be made
simpler.

Hopefully the proportion of programmers who still feel they can make
such a parochial choice is rapidly shrinking.

-- 
 \ “Why doesn't Python warn that it's not 100% perfect? Are people |
  `\ just supposed to “know” this, magically?” —Mitya Sirenef, |
_o__) comp.lang.python, 2012-12-27 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 7:59:55 AM UTC+5:30, Rustom Mody wrote:
 Why should I pay more for a EURO sign than a $ sign?

A unicode 'headache' there:
I typed the Euro sign (trying again € ) not EURO

Somebody -- I guess its GG in overhelpful mode -- converted it
And made my post: 
Content-Type: text/plain; charset=ISO-8859-1

Will some devanagarari vowels help it stop being helpful?
अ आ इ ई उ ऊ ए ऐ
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 8:09:44 AM UTC+5:30, Ben Finney wrote:
 Rustom Mody  writes:

  Yes, the headaches go a little further back than Unicode.

 Okay, so can you change your article to reflect the fact that the
 headaches both pre-date Unicode, and are made much easier by Unicode?

Predate: Yes
Made easier: No

  There is a certain large old book...

 Ah yes, the neo-Sumerian story Enmerkar_and_the_Lord_of_Aratta
 URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta.
 Probably inspired by stories older than that, of course.

Thanks for that link

  In which is described the building of a 'tower that reached up to heaven'...
  At which point 'it was decided'¶ to do something to prevent that.
  And our headaches started.

 And other myths with fantastic reasons for the diversity of language
 URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language.

This one takes the cake - see 1st para
http://hilgart.org/enformy/BronsonRekindling.pdf


  I never knew of any of this in the good ol days of ASCII

 Yes, by ignoring all other writing systems except one's own - and
 thereby excluding most of the world's people - the system can be made
 simpler.

 Hopefully the proportion of programmers who still feel they can make
 such a parochial choice is rapidly shrinking.

See link above: Ethnic differences and chauvinism are invariably linked
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Chris Angelico
On Fri, May 2, 2014 at 12:29 PM, Rustom Mody rustompm...@gmail.com wrote:
 Here is an instance of someone who would like a certain optimization to be
 dis-able-able

 https://mail.python.org/pipermail/python-list/2014-February/667169.html

 To the best of my knowledge its nothing to do with unicode or with jmf.

It doesn't, and it has only to do with testing. I've had similar
issues at times; for instance, trying to benchmark one language or
language construct against another often means fighting against an
optimizer. (How, for instance, do you figure out what loop overhead
is, when an empty loop is completely optimized out?) This is nothing
whatsoever to do with Unicode, nor to do with the optimization that
Python and Pike (and maybe other languages) do with the storage of
Unicode strings.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Steven D'Aprano
On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote:

 strange beasties like python's FSR
 
 Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
 is an *internal optimization* that benefits most unicode operations that
 people actually perform. It uses UTF-32 by default but adapts to the
 strings users create by compressing the internal format. The compression
 is trivial -- simple dropping leading null bytes common to all
 characters -- so each character is still readable as is.

For anyone who, like me, wasn't convinced that Unicode worked that way, 
you can see for yourself that it does. You don't need Python 3.3, any 
version of 3.x will work. In Python 2.7, it should work if you just 
change the calls from chr() to unichr():

py for i in range(256):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:3] == b'\0\0\0'
... assert u[3:] == c.encode('latin-1')
...
py for i in range(256, 0x+1):
... c = chr(i)
... u = c.encode('utf-32-be')
... assert u[:2] == b'\0\0'
... assert u[2:] == c.encode('utf-16-be')
...
py 


So Terry is correct: dropping leading zeroes, and treating the remainder 
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of 
memory.


-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 8:31:56 AM UTC+5:30, Chris Angelico wrote:
 On Fri, May 2, 2014 at 12:29 PM, Rustom Mody wrote:
  Here is an instance of someone who would like a certain optimization to be
  dis-able-able
  https://mail.python.org/pipermail/python-list/2014-February/667169.html
  To the best of my knowledge its nothing to do with unicode or with jmf.

 It doesn't, and it has only to do with testing. I've had similar
 issues at times; for instance, trying to benchmark one language or
 language construct against another often means fighting against an
 optimizer. (How, for instance, do you figure out what loop overhead
 is, when an empty loop is completely optimized out?) This is nothing
 whatsoever to do with Unicode, nor to do with the optimization that
 Python and Pike (and maybe other languages) do with the storage of
 Unicode strings.

This was said in response to Terry's

 CPython has many other functions with special-case optimizations and
 'fast paths' for common, simple cases. For instance, (some? all?) number
 operations are optimized for pairs of integers.  Do you call these
 'strange beasties'?

which evidently vanished -- optimized out :D -- in multiple levels of quoting 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Terry Reedy

On 5/1/2014 7:33 PM, MRAB wrote:

On 2014-05-01 23:38, Terry Reedy wrote:

On 5/1/2014 2:04 PM, Rustom Mody wrote:


Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


I will not comment on the Unix-assumption part, but I think you go wrong
with this:  Unicode is a Headache. The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.


[snip]
I think he's right when he says Unicode is a headache, but only
because it's being used to handle languages which are, themselves, a
headache: left-to-right versus right-to-left, sometimes on the same
line;


Handling that without unicode is even worse.


diacritics, possibly several on a glyph; etc.


Ditto.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Rustom Mody
On Friday, May 2, 2014 9:46:36 AM UTC+5:30, Terry Reedy wrote:
 On 5/1/2014 7:33 PM, MRAB wrote:
  On 2014-05-01 23:38, Terry Reedy wrote:
  On 5/1/2014 2:04 PM, Rustom Mody wrote:
  Since its Unicode-troll time, here's my contribution
  http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
  I will not comment on the Unix-assumption part, but I think you go wrong
  with this:  Unicode is a Headache. The major headache is that unicode
  and its very few encodings are not universally used. The headache is all
  the non-unicode legacy encodings still being used. So you better title
  this section 'Non-Unicode is a Headache'.
  [snip]
  I think he's right when he says Unicode is a headache, but only
  because it's being used to handle languages which are, themselves, a
  headache: left-to-right versus right-to-left, sometimes on the same
  line;

 Handling that without unicode is even worse.

  diacritics, possibly several on a glyph; etc.

 Ditto.

Whats the best cure for headache?

Cut off the head

Whats the best cure for Unicode?

Ascii

Saying however that there is no headache in unicode does not make the headache
go away:

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

No I am not saying that the contents/style/tone are right.
However people are evidently suffering the transition.
Denying it is not a help.

And unicode consortium's ways are not exactly helpful to its own cause:
Imagine the C standard committee deciding that adding mandatory garbage 
collection
to C is a neat idea

Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what
in the future is similar.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Chris Angelico
On Fri, May 2, 2014 at 2:42 PM, Rustom Mody rustompm...@gmail.com wrote:
 Unicode consortium's going from old BMP to current (6.0) SMPs to 
 who-knows-what
 in the future is similar.

Unicode 1.0: Let's make a single universal character set that can
represent all the world's scripts. We'll define 65536 codepoints to do
that with.

Unicode 2.0: Oh. That's not enough. Okay, let's define some more.

It's not a fundamental change, nor is it unhelpful to Unicode's cause.
It's simply an acknowledgement that 64K codepoints aren't enough. Yes,
that gave us the mess of UTF-16 being called Unicode (if it hadn't
been for Unicode 1.0, I doubt we'd now have so many languages using
and exposing UTF-16 - it'd be a simple judgment call, pick
UTF-8/UTF-16/UTF-32 based on what you expect your users to want to
use), but it doesn't change Unicode's goal, and it also doesn't
indicate that there's likely to be any more such changes in the
future. (Just look at how little of the Unicode space is allocated so
far.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-05-01 Thread Terry Reedy

On 5/1/2014 10:29 PM, Rustom Mody wrote:


Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.


Right. Ned has an actual technical reason to complain, even though the 
developers do not consider it strong enough to act.



Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?


One reason is that many optimizations sometimes introduce bugs, or to 
put it another way, they are based on assumptions that are not true for 
all code. For instance, some people have suggested that CPython should 
have an optional optimization based on the assumption that builtin names 
are never rebound. That is true for perhaps many code files, but 
definitely not all. Guido does not seem to like such conditional 
optimizations.


I can think of three reasons for not adding to the numerous options 
CPython already has.
1. We do not have the developers resources to handle the added 
complications of multiple optimization options.
2. Zillions of options and flags confuse users. As it is, most options 
are seldom used.
3. Optimization options are easily misused, possibly leading to silently 
buggy results, or mysterious failures. For instance, people sometimes 
rebind builtins without realizing what they have done, such as using 
'id' as a parameter name. Being in the habit of routinely using the 
'assume no rebinding option' would lead to problems.


I am rather sure that the string (unicode) test suite was reviewed and 
the performance of 3.2 wide builds recorded before the new 
implementation was committed.


The tracker currently has 37 behavior (bug) issues marked for the 
unicode component. In a quick review, I do not see that any have 
anything to do with using standard UTF-32 versus adaptive UTF-32. 
Indeed, I believe a majority of the 37 were filed before 3.3 or are 2.7 
specific. Problems with FSR itself have been fixed as discovered.



JFTR I have no issue with FSR.  What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am 
one of them]


Somewhat ironically, I suppose your are right.


I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:


I would call his view personal or philosophical. I only object to 
endless repetition and the deception of claiming that personal views are 
mathematical facts.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-30 Thread wxjmfauth
@ Time Chase

I'm perfectly aware about what I'm doing.


@ MRAB

...Although the third example is the fastest, it's also the wrong
way to handle Unicode: ...

Maybe that's exactly the opposite. It illustrates very well,
the quality of coding schemes endorsed by Unicode.org.
I deliberately choose utf-8.


 sys.getsizeof('\u0fce')
40
 sys.getsizeof('\u0fce'.encode('utf-8'))
20
 sys.getsizeof('\u0fce'.encode('utf-16-be'))
19
 sys.getsizeof('\u0fce'.encode('utf-32-be'))
21
 

Q. How to save memory without wasting time in encoding?
By using products using natively the unicode coding schemes?

Are you understanding unicode? Or are you understanding
unicode via Python?

---

A Tibetan monk [*] using Py32:

 timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z')
[2.3394840182882186, 2.3145832750782653, 2.3207231951529685]
 timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = '\u0fce')
[2.328517624800078, 2.3169403900011076, 2.317586282812048]


[*] Your curiosity has certainly shown, what this code point means.
For the others:
U+0FCE TIBETAN SIGN RDEL NAG RDEL DKAR
signifies good luck earlier, bad luck later


(My comment: Good luck with Python or bad luck with Python)

jmf
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-30 Thread Tim Chase
On 2014-04-30 00:06, wxjmfa...@gmail.com wrote:
 @ Time Chase
 
 I'm perfectly aware about what I'm doing.

Apparently, you're quite adept at appending superfluous characters to
sensible strings...did you benchmark your email composition, too? ;-)

-tkc (aka Tim, not Time)




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-30 Thread Steven D'Aprano
On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote:

 On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
 While I dislike feeding the troll, what I see here is:
 
 snipped
 
 Since its Unicode-troll time, here's my contribution
 http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


I disagree with much of your characterisation of the Unix assumption, and 
I point out that out of the two most widespread flavours of OS today, 
Linux/Unix and Windows, it is *Windows* and not Unix which still 
regularly uses legacy encodings.

Also your link to Joel On Software mistakenly links to me instead of Joel.

There's a missing apostrophe in Ive [sic] in Acknowledgment #2.

I didn't notice any other typos.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-29 Thread Tim Chase
On 2014-04-29 10:37, wxjmfa...@gmail.com wrote:
  timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z')  
 [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
  timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y =
  '\u0fce')  
 [5.462776291480395, 5.4479432055423445, 5.447874284053398]
  
  
  # more interesting
  timeit.repeat((x*1000 + y)[:-1],\  
 ... setup=x = 'abc'.encode('utf-8'); y =
 '\u0fce'.encode('utf-8')) [1.3496489533188765, 1.328654286266783,
 1.3300913977710707]


While I dislike feeding the troll, what I see here is:  on your
machine, all unicode manipulations in the test should take ~5.4
seconds.  But Python notices that some of your strings *don't*
require a full 32-bits and thus optimizes those operations, cutting
about 75% of the processing time (wow...4-bytes-per-char to
1-byte-per-char, I wonder where that 75% savings comes from).

So rather than highlight any *problem* with Python, your [mostly
worthless microbenchmark non-realworld] tests show that Python's
unicode implementation is awesome.

Still waiting to see an actual bug-report as mentioned on the other
thread.

-tkc





-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-29 Thread MRAB

On 2014-04-29 18:37, wxjmfa...@gmail.com wrote:

Let see how Python is ready for the next Unicode version
(Unicode 7.0.0.Beta).



timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z')

[1.4027834829454946, 1.38714224331963, 1.3822586635296261]

timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = '\u0fce')

[5.462776291480395, 5.4479432055423445, 5.447874284053398]



# more interesting
timeit.repeat((x*1000 + y)[:-1],\

... setup=x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8'))
[1.3496489533188765, 1.328654286266783, 1.3300913977710707]





Although the third example is the fastest, it's also the wrong way to
handle Unicode:

 x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')
 t = (x*1000 + y)[:-1].decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 
3000-3001: unex

pected end of data


Note 1:  lookup is not the problem.

Note 2: From Unicode.org : [...] We strongly encourage [...] and test
them with their programs [...]

- Done.

jmf



--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode 7

2014-04-29 Thread Rustom Mody
On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote:
 While I dislike feeding the troll, what I see here is: 

snipped

Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

:-)

More seriously, since Ive quoted some esteemed members of this list 
explicitly (Steven) and the list in general, please let me know if
something is inaccurate or inappropriate

 
-- 
https://mail.python.org/mailman/listinfo/python-list