Re: Unicode 7
On Thu, 01 May 2014 21:42:21 -0700, Rustom Mody wrote: Whats the best cure for headache? Cut off the head o_O I don't think so. Whats the best cure for Unicode? Ascii Unicode is not a problem to be solved. The inability to write standard human text in ASCII is a problem, e.g. one cannot write “ASCII For Dummies” © 2014 by Zöe Smith, now on sale 99¢ so even *Americans* cannot represent all their common characters in ASCII, let alone specialised characters from mathematics, science, the printing industry, and law. And even Americans sometimes need to write text in Foreign. Where is your ASCII now? The solution is to have at least one encoding which contains the additional characters needed. The plethora of such additional encodings is a problem. The solution is a single encoding that covers all needed characters, like Unicode, so that there is no need to handle multiple encodings. The inability for plain text files to record metadata of what encoding they use is a problem. The solution is to standardize on a single, world- wide encoding, like Unicode. Saying however that there is no headache in unicode does not make the headache go away: http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ No I am not saying that the contents/style/tone are right. However people are evidently suffering the transition. Denying it is not a help. Transitions are always more painful than after the transition has settled down. As I have said repeatedly, I look forward for the day when nobody but document archivists and academics need care about legacy encodings. But we're not there yet. And unicode consortium's ways are not exactly helpful to its own cause: Imagine the C standard committee deciding that adding mandatory garbage collection to C is a neat idea Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what in the future is similar. I don't see the connection. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: I dont know how one causally connects the 'headaches' but Ive seen - mojibake Mojibake is certainly more common with multiple encodings, but the solution to that is Unicode, not ASCII. In fact, in your blog post you even link to a post of mine where I explain that ASCII has gone through multiple backwards incompatible changes over the decades, which means you can have a limited form of mojibake even in pure ASCII. Between changes over various versions of ASCII, and ambiguous characters allowed by the standard, you needed some sort of out-of-band metadata to tell you whether they intended an @ or a `, a | or a ¬, a £ or a #, to mention only a few. It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become an unambiguous standard. But that's okay, because that merely allowed people to create dozens of 7-bit and 8-bit variations on ASCII, all incompatible with each other, and *call them ASCII* regardless of the actual standard name. Between ambiguities in actual ASCII, and common practice to label non- ASCII as ASCII, I can categorically say that mojibake has always been possible in so-called plain text. If you haven't noticed it, it was because you were only exchanging documents with people who happened to use the same set of characters as you. - unicode 'number-boxes' (what are these called?) They are missing character glyphs, and they have nothing to do with Unicode. They are due to deficiencies in the text font you are using. Admittedly with Unicode's 0x10 possible characters (actually more, since a single code point can have multiple glyphs) it isn't surprising that most font designers have neither the time, skill or desire to create a glyph for every single code point. But then the same applies even for more restrictive 8-bit encodings -- sometimes font designers don't even bother providing glyphs for *ASCII* characters. (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.) - Worst of all what we *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. I never knew of any of this in the good ol days of ASCII You must have been happy with a very impoverished set of symbols, then. ¶ Passive voice is often the best choice in the interests of political correctness It would be a pleasant surprise if everyone sees a pilcrow at start of line above I do. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: ... even *Americans* cannot represent all their common characters in ASCII, let alone specialised characters from mathematics, science, the printing industry, and law. Aside: What additional characters does law use that aren't in ASCII? Section § and paragraph ¶ are used frequently, but you already mentioned the printing industry. Are there other symbols? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, May 2, 2014 at 6:45 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: - unicode 'number-boxes' (what are these called?) They are missing character glyphs, and they have nothing to do with Unicode. They are due to deficiencies in the text font you are using. Admittedly with Unicode's 0x10 possible characters (actually more, since a single code point can have multiple glyphs) it isn't surprising that most font designers have neither the time, skill or desire to create a glyph for every single code point. But then the same applies even for more restrictive 8-bit encodings -- sometimes font designers don't even bother providing glyphs for *ASCII* characters. (E.g. they may only provide glyphs for uppercase A...Z, not lowercase.) This is another area where Unicode has given us a great improvement over the old method of giving satisfaction. Back in the 1990s on OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a simple square with no information, or (c) copied from some other font (common with dingbats fonts). With Unicode, the standard is to show a little box *with the hex digits in it*. Granted, those boxes are a LOT more readable for BMP characters than SMP (unless your text is huge, six digits in the space of one character will make them pretty tiny), and a Unicode font will generally include all (or at least most) of the BMP, but it's still better than having no information at all. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Chris Angelico ros...@gmail.com writes: On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: ... even *Americans* cannot represent all their common characters in ASCII, let alone specialised characters from mathematics, science, the printing industry, and law. Aside: What additional characters does law use that aren't in ASCII? Section § and paragraph ¶ are used frequently, but you already mentioned the printing industry. Are there other symbols? ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE REGISTERED SIGN), for instance. -- \ “I got some new underwear the other day. Well, new to me.” —Emo | `\ Philips | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, May 2, 2014 at 7:16 PM, Ben Finney b...@benfinney.id.au wrote: Chris Angelico ros...@gmail.com writes: On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: ... even *Americans* cannot represent all their common characters in ASCII, let alone specialised characters from mathematics, science, the printing industry, and law. Aside: What additional characters does law use that aren't in ASCII? Section § and paragraph ¶ are used frequently, but you already mentioned the printing industry. Are there other symbols? ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE REGISTERED SIGN), for instance. Heh! I forgot about those. U+00A9 in particular has gone so mainstream that it's easy to think of it not as I'm going to switch to my 'British English + Legal' dictionary now and just as This is a critical part of the basic dictionary. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Chris Angelico writes: (common with dingbats fonts). With Unicode, the standard is to show a little box *with the hex digits in it*. Granted, those boxes are a LOT more readable for BMP characters than SMP (unless your text is huge, six digits in the space of one character will make them pretty tiny), and a Unicode font will generally include all (or at least most) of the BMP, but it's still better than having no information I needed to see such tiny numbers just today, just the four of them in the tiny box. So I pressed C-+ a few times to _make_ the text huge, obtained my information, and returned to my normal text size with C--. Perfect. Usually all I need to know is that I have a character for which I don't have a glyph, but this time I wanted to record the number because I was testing things rather than reading the text. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Ben Finney b...@benfinney.id.au: Aside: What additional characters does law use that aren't in ASCII? Section § and paragraph ¶ are used frequently, but you already mentioned the printing industry. Are there other symbols? ASCII does not contain “©” (U+00A9 COPYRIGHT SIGN) nor “®” (U+00AE REGISTERED SIGN), for instance. The em-dash is mapped on my keyboard — I use it quite often. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote: On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: - Worst of all what we *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. Thats not what I meant. I wrote http://blog.languager.org/2014/04/unicoded-python.html – mostly on a debian box. Later on seeing it on a less heavily setup ubuntu box, I see ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊ have become 'missing-glyph' boxes. It leads me ask, how much else of what I am writing, some random reader has simply not seen? Quite simply we can never know – because most are going to go away saying mojibaked/garbled rubbish Speaking of what you understood of what I said: Yes invisible chars is another problem I was recently bitten by. I pasted something from google into emacs' org mode. Following that link again I kept getting a broken link. Until I found that the link had an invisible char The problem was that emacs was faithfully rendering that char according to standard, ie invisibly! -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, 02 May 2014 19:01:44 +1000, Chris Angelico wrote: On Fri, May 2, 2014 at 6:08 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: ... even *Americans* cannot represent all their common characters in ASCII, let alone specialised characters from mathematics, science, the printing industry, and law. Aside: What additional characters does law use that aren't in ASCII? Section § and paragraph ¶ are used frequently, but you already mentioned the printing industry. Are there other symbols? I was thinking of copyright, trademark, registered mark, and similar. I think these are all of relevant characters: py for c in '©®℗™': ... unicodedata.name(c) ... 'COPYRIGHT SIGN' 'REGISTERED SIGN' 'SOUND RECORDING COPYRIGHT' 'TRADE MARK SIGN' -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote: On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote: On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: - Worst of all what we *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. Thats not what I meant. I wrote http://blog.languager.org/2014/04/unicoded-python.html – mostly on a debian box. Later on seeing it on a less heavily setup ubuntu box, I see ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊ have become 'missing-glyph' boxes. It leads me ask, how much else of what I am writing, some random reader has simply not seen? Quite simply we can never know – because most are going to go away saying mojibaked/garbled rubbish Speaking of what you understood of what I said: Yes invisible chars is another problem I was recently bitten by. I pasted something from google into emacs' org mode. Following that link again I kept getting a broken link. Until I found that the link had an invisible char The problem was that emacs was faithfully rendering that char according to standard, ie invisibly! And you've never been bitten by an invisible control character in ASCII text? You've lived a sheltered life! Nothing you are describing is unique to Unicode. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: And you've never been bitten by an invisible control character in ASCII text? You've lived a sheltered life! That reminds me: (nonbreakable space) is often used between numbers and units, for example. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-05-02 19:08, Chris Angelico wrote: This is another area where Unicode has given us a great improvement over the old method of giving satisfaction. Back in the 1990s on OS/2, DOS, and Windows, a missing glyph might be (a) blank, (b) a simple square with no information, or (c) copied from some other font (common with dingbats fonts). With Unicode, the standard is to show a little box *with the hex digits in it*. Granted, those boxes are a LOT more readable for BMP characters than SMP (unless your text is huge, six digits in the space of one character will make them pretty tiny), and a Unicode font will generally include all (or at least most) of the BMP, but it's still better than having no information at all. I'm pleased when applications fonts work properly, using both the placeholder fonts for this character is legitimate but I can't display it with a font, so here, have a box with the codepoint numbers in it until I'm directed to use a more appropriate font at which point you'll see it correctly and the somebody crammed garbage in here, so I'll display it with � (U+FFFD) which is designated for exactly this purpose. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 5:25:37 PM UTC+5:30, Steven D'Aprano wrote: On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote: On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote: On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: - Worst of all what we *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. Thats not what I meant. I wrote http://blog.languager.org/2014/04/unicoded-python.html – mostly on a debian box. Later on seeing it on a less heavily setup ubuntu box, I see ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊ have become 'missing-glyph' boxes. It leads me ask, how much else of what I am writing, some random reader has simply not seen? Quite simply we can never know – because most are going to go away saying mojibaked/garbled rubbish Speaking of what you understood of what I said: Yes invisible chars is another problem I was recently bitten by. I pasted something from google into emacs' org mode. Following that link again I kept getting a broken link. Until I found that the link had an invisible char The problem was that emacs was faithfully rendering that char according to standard, ie invisibly! And you've never been bitten by an invisible control character in ASCII text? You've lived a sheltered life! For control characters Ive seen: - garbage (the ASCII equiv of mojibake) - Straight ^A^B^C - Maybe their names NUL,SOH,STX,ETX,EOT,ENQ,ACK… - Or maybe just a little dot . - More pathological behavior: a control sequence putting the terminal into some other mode But I dont ever remember seeing a control character become invisible (except [ \t\n\f]) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-05-02 03:39, Ben Finney wrote: Rustom Mody rustompm...@gmail.com writes: Yes, the headaches go a little further back than Unicode. Okay, so can you change your article to reflect the fact that the headaches both pre-date Unicode, and are made much easier by Unicode? There is a certain large old book... Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta” URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta. Probably inspired by stories older than that, of course. In which is described the building of a 'tower that reached up to heaven'... At which point 'it was decided'¶ to do something to prevent that. And our headaches started. And other myths with fantastic reasons for the diversity of language URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language. I never knew of any of this in the good ol days of ASCII Yes, by ignoring all other writing systems except one's own – and thereby excluding most of the world's people – the system can be made simpler. ASCII lacked even £. I can remember assembly listings in magazines containing lines such as: LDA £0 I even (vaguely) remember an advert with a character that looked like Ł, presumably because they didn't have £. In a UK magazine? Very strange! Hopefully the proportion of programmers who still feel they can make such a parochial choice is rapidly shrinking. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 5:25:37 PM UTC+5:30, Steven D'Aprano wrote: On Fri, 02 May 2014 03:39:34 -0700, Rustom Mody wrote: On Friday, May 2, 2014 2:15:41 PM UTC+5:30, Steven D'Aprano wrote: On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote: - Worst of all what we *dont* see -- how many others dont see what we see? Again, this a deficiency of the font. There are very few code points in Unicode which are intended to be invisible, e.g. space, newline, zero- width joiner, control characters, etc., but they ought to be equally invisible to everyone. No printable character should ever be invisible in any decent font. Thats not what I meant. I wrote http://blog.languager.org/2014/04/unicoded-python.html – mostly on a debian box. Later on seeing it on a less heavily setup ubuntu box, I see ⟮ ⟯ ⟬ ⟭ ⦇ ⦈ ⦉ ⦊ have become 'missing-glyph' boxes. It leads me ask, how much else of what I am writing, some random reader has simply not seen? Quite simply we can never know – because most are going to go away saying mojibaked/garbled rubbish Speaking of what you understood of what I said: Yes invisible chars is another problem I was recently bitten by. I pasted something from google into emacs' org mode. Following that link again I kept getting a broken link. Until I found that the link had an invisible char The problem was that emacs was faithfully rendering that char according to standard, ie invisibly! And you've never been bitten by an invisible control character in ASCII text? You've lived a sheltered life! Nothing you are describing is unique to Unicode. Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) Prelude In case its not apparent, the fi in the first fine is a ligature. Python just barfs: fine = 1 File stdin, line 1 fine = 1 ^ SyntaxError: invalid syntax The point of that example is to show that unicode gives all kind of Aaah! Gotcha!! opportunities that just dont exist in the old world. Python may have got this one right but there are surely dozens of others. On the other hand I see more eagerness for unicode source-text there eg. https://github.com/i-tu/Hasklig http://www.haskell.org/ghc/docs/latest/html/users_guide/syntax-extns.html#unicode-syntax http://www.haskell.org/haskellwiki/Unicode-symbols http://hackage.haskell.org/package/base-unicode-symbols Some music 턞 턢 ♭ 턱 to appease the utf-8 gods -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-05-02 09:08, Steven D'Aprano wrote: On Thu, 01 May 2014 21:42:21 -0700, Rustom Mody wrote: Whats the best cure for headache? Cut off the head o_O I don't think so. Whats the best cure for Unicode? Ascii Unicode is not a problem to be solved. The inability to write standard human text in ASCII is a problem, e.g. one cannot write “ASCII For Dummies” © 2014 by Zöe Smith, now on sale 99¢ [snip] Shouldn't that be Zoë? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 05/02/2014 10:50 AM, Rustom Mody wrote: Python just barfs: fine = 1 File stdin, line 1 fine = 1 ^ SyntaxError: invalid syntax The point of that example is to show that unicode gives all kind of Aaah! Gotcha!! opportunities that just dont exist in the old world. Python may have got this one right but there are surely dozens of others. Except that it doesn't. This has nothing to do with unicode handling. It has everything to do with what defines an identifier in Python. This is no different than someone wondering why they can't start an identifier in Python 1.x with a number or punctuation mark. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/2/14 12:50 PM, Rustom Mody wrote: Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) Prelude In case its not apparent, the fi in the first fine is a ligature. Python just barfs: fine = 1 File stdin, line 1 fine = 1 ^ SyntaxError: invalid syntax Surely by now we could at least be explicit about which version of Python we are talking about? $ python2.7 Python 2.7.2 (default, Oct 11 2012, 20:14:37) [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin Type help, copyright, credits or license for more information. fine = 1 File stdin, line 1 fine = 1 ^ SyntaxError: invalid syntax ^D $ python3.4 Python 3.4.0b1 (default, Dec 16 2013, 21:05:22) [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin Type help, copyright, credits or license for more information. fine = 1 fine 1 In Python 2 identifiers must be ASCII. Python 3 allows many Unicode characters in identifiers (see PEP 3131 for details: http://legacy.python.org/dev/peps/pep-3131/) -- Ned Batchelder, http://nedbatchelder.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Rustom Mody wrote: Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) Prelude In case its not apparent, the fi in the first fine is a ligature. Python just barfs: Not Python 3: Python 3.3.2+ (default, Feb 28 2014, 00:52:16) [GCC 4.8.1] on linux Type help, copyright, credits or license for more information. (fine, fine) = (1,2) (fine, fine) (2, 2) No copy-and-paste errors involved: eval(\ufb01ne) 2 eval(bfine.decode(ascii)) 2 -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Marko Rauhamaa ma...@pacujo.net writes: That reminds me: [U+00A0 NON-BREAKING SPACE] is often used between numbers and units, for example. The non-breaking space (“ ” U+00A0) is frequently used in text to keep conceptually inseparable text such as “100 km” from automatic word breaks URL:https://en.wikipedia.org/wiki/Non-breaking_space. Because of established, conflicting conventions for separating groups of digits (“1,234.00” in many countries; “1.234,00” in many others) URL:https://en.wikipedia.org/wiki/Thousands_separator#Digit_grouping, the “ ” U+2009 THIN SPACE URL:https://en.wikipedia.org/wiki/Thin_Space is recommended for separating digit groups (e.g. “1 234 567 m”) URL:https://en.wikipedia.org/wiki/SI_units#General_rules. -- \ “We spend the first twelve months of our children's lives | `\ teaching them to walk and talk and the next twelve years | _o__) telling them to sit down and shut up.” —Phyllis Diller | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
In article mailman.9659.1399064866.18130.python-l...@python.org, Ben Finney b...@benfinney.id.au wrote: The non-breaking space (â â U+00A0) is frequently used in text to keep conceptually inseparable text such as â100 kmâ from automatic word breaks URL:https://en.wikipedia.org/wiki/Non-breaking_space. Which, by the way, argparse doesn't honor... http://bugs.python.org/issue16623 -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote: Rustom Mody wrote: Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) In case its not apparent, the fi in the first fine is a ligature. Python just barfs: Not Python 3: Python 3.3.2+ (default, Feb 28 2014, 00:52:16) [GCC 4.8.1] on linux Type help, copyright, credits or license for more information. (fine, fine) = (1,2) (fine, fine) (2, 2) No copy-and-paste errors involved: eval(\ufb01ne) 2 eval(bfine.decode(ascii)) 2 Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad. I am confused about the tone however: You think this (fine, fine) = (1,2) # and no issue about it is fine? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Sat, May 3, 2014 at 10:58 AM, Rustom Mody rustompm...@gmail.com wrote: You think this (fine, fine) = (1,2) # and no issue about it is fine? Not sure which part you're objecting to. Are you saying that this should be an error: a, a = 1, 2 # simple ASCII identifier used twice or that Python should take the exact sequence of codepoints, rather than normalizing? Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09) [GCC 4.7.2] on linux Type help, copyright, credits or license for more information. fine = 1 vars() {'__package__': None, '__spec__': None, '__doc__': None, 'fine': 1, '__loader__': class '_frozen_importlib.BuiltinImporter', '__builtins__': module 'builtins' (built-in), '__name__': '__main__'} As regards normalization, I would be happy with either keep it exactly as you provided or normalize according to insert Unicode standard normalization here, as long as it's consistent. It's like what happens with SQL identifiers: according to the standard, an unquoted name should be uppercased, but some databases instead lowercase them. It doesn't break code (modulo quoted names, not applicable here), as long as it's consistent. (My reading of PEP 3131 is that NFKC is used; is that what's implemented, or was that a temporary measure and/or something for Py2 to consider?) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/2/14 8:58 PM, Rustom Mody wrote: On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote: Rustom Mody wrote: Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) In case its not apparent, the fi in the first fine is a ligature. Python just barfs: Not Python 3: Python 3.3.2+ (default, Feb 28 2014, 00:52:16) [GCC 4.8.1] on linux Type help, copyright, credits or license for more information. (fine, fine) = (1,2) (fine, fine) (2, 2) No copy-and-paste errors involved: eval(\ufb01ne) 2 eval(bfine.decode(ascii)) 2 Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad. I am confused about the tone however: You think this (fine, fine) = (1,2) # and no issue about it is fine? Can you be more explicit? It seems like you think it isn't fine. Why not? What bothers you about it? Should there be an issue? -- Ned Batchelder, http://nedbatchelder.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Saturday, May 3, 2014 6:48:21 AM UTC+5:30, Ned Batchelder wrote: On 5/2/14 8:58 PM, Rustom Mody wrote: On Friday, May 2, 2014 11:37:02 PM UTC+5:30, Peter Otten wrote: Rustom Mody wrote: Just noticed a small thing in which python does a bit better than haskell: $ ghci let (fine, fine) = (1,2) Prelude (fine, fine) (1,2) In case its not apparent, the fi in the first fine is a ligature. Python just barfs: Not Python 3: Python 3.3.2+ (default, Feb 28 2014, 00:52:16) [GCC 4.8.1] on linux Type help, copyright, credits or license for more information. (fine, fine) = (1,2) (fine, fine) (2, 2) No copy-and-paste errors involved: eval(\ufb01ne) 2 eval(bfine.decode(ascii)) 2 Aah! Thanks Peter (and Ned and Michael) — 2-3 confusion — my bad. I am confused about the tone however: You think this (fine, fine) = (1,2) # and no issue about it is fine? Can you be more explicit? It seems like you think it isn't fine. Why not? What bothers you about it? Should there be an issue? Two identifiers that to some programmers - can look the same - and not to others - and that the language treats as different is not fine (or fine) to me. Putting them together as I did is summarizing the problem. Think of them textually widely separated. And the code (un)serendipitously 'working' (ie not giving NameErrors) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Sat, May 3, 2014 at 11:42 AM, Rustom Mody rustompm...@gmail.com wrote: Two identifiers that to some programmers - can look the same - and not to others - and that the language treats as different is not fine (or fine) to me. The language treats them as the same, though. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, 02 May 2014 17:58:51 -0700, Rustom Mody wrote: I am confused about the tone however: You think this (fine, fine) = (1,2) # and no issue about it is fine? It's no worse than any other obfuscated variable name: MOOSE, MO0SE, M0OSE = 1, 2, 3 xl, x1 = 1, 2 If you know your victim is reading source code in Ariel font, rn and m are virtually indistinguishable except at very large sizes. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Saturday, May 3, 2014 7:24:08 AM UTC+5:30, Chris Angelico wrote: On Sat, May 3, 2014 at 11:42 AM, Rustom Mody wrote: Two identifiers that to some programmers - can look the same - and not to others - and that the language treats as different is not fine (or fine) to me. The language treats them as the same, though. Whoops! I seem to be goofing a lot today Saw Peter's (fine, fine) = (1,2) Didn't notice his next line (fine, fine) (2, 2) So then I am back to my original point: Python is giving better behavior than Haskell in this regard! [Earlier reached this conclusion via a wrong path] -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Sat, 03 May 2014 02:02:32 +, Steven D'Aprano wrote: On Fri, 02 May 2014 17:58:51 -0700, Rustom Mody wrote: I am confused about the tone however: You think this (fine, fine) = (1,2) # and no issue about it is fine? It's no worse than any other obfuscated variable name: MOOSE, MO0SE, M0OSE = 1, 2, 3 xl, x1 = 1, 2 If you know your victim is reading source code in Ariel font, rn and m are virtually indistinguishable except at very large sizes. Ooops! I too missed that Python normalises the name fine to fine, so in fact this is not a case of obfuscation. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Sat, May 3, 2014 at 12:02 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: If you know your victim is reading source code in Ariel font, rn and m are virtually indistinguishable except at very large sizes. I kinda like the idea of naming it after a bratty teenager who rebels against her father and runs away from home, but normally the font's called Arial. :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/2/2014 9:15 PM, Chris Angelico wrote: (My reading of PEP 3131 is that NFKC is used; is that what's implemented, or was that a temporary measure and/or something for Py2 to consider?) The 3.4 docs say The syntax of identifiers in Python is based on the Unicode standard annex UAX-31, with elaboration and changes as defined below; see also PEP 3131 for further details. ... All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC. Without reading UAX-31, I don't know how much was changed, but I suspect not much. In any case, the current rules are intended and very unlikely to change as that would break code going either forward or back for little purpose. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Le mercredi 30 avril 2014 20:48:48 UTC+2, Tim Chase a écrit : On 2014-04-30 00:06, wxjmfa...@gmail.com wrote: @ Time Chase I'm perfectly aware about what I'm doing. Apparently, you're quite adept at appending superfluous characters to sensible strings...did you benchmark your email composition, too? ;-) -tkc (aka Tim, not Time) Mea culpa, ... -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Thursday, May 1, 2014 10:30:43 AM UTC+5:30, Steven D'Aprano wrote: On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote: On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: While I dislike feeding the troll, what I see here is: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html Also your link to Joel On Software mistakenly links to me instead of Joel. There's a missing apostrophe in Ive [sic] in Acknowledgment #2. Done, Done. I didn't notice any other typos. Thank you sir! I point out that out of the two most widespread flavours of OS today, Linux/Unix and Windows, it is *Windows* and not Unix which still regularly uses legacy encodings. Not sure what you are suggesting... That (I am suggesting that) 8859 is legacy and 1252 is not? I disagree with much of your characterisation of the Unix assumption, I'd be interested to know the details -- Contents? Details? Tone? Tenor? Blaspheming the sacred scripture? (if you are so inclined of course) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. The first sentence is this misleading tautology: With ASCII, data is ASCII whether its file, core, terminal, or network; ie ABC is 65,66,67. Let me translate: If all text is ASCII encoded, then text data is ASCII, whether ... But it was never the case that all text was ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe still uses the latter. Other mainframe makers used other encodings of A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never universal. You could have just as well said With EBCDIC, data is EBCDIC, whether ... https://en.wikipedia.org/wiki/Ascii https://en.wikipedia.org/wiki/EBCDIC A crucial step in the spread of Ascii was its use for microcomputers, including the IBM PC. The latter was considered a toy by the mainframe guys. If they had known that PCs would partly take over the computing world, they might have suggested or insisted that the it use EBCDIC. With unicode there are: encodings where 'encodings' is linked to https://en.wikipedia.org/wiki/Character_encodings_in_HTML If html 'always' used utf-8 (like xml), as has become common but not universal, all of the problems with *non-unicode* character sets and encodings would disappear. The pre-unicode declarations could then disappear. More truthful: without unicode there are 100s of encodings and with unicode only 3 that we should worry about. in-memory formats These are not the concern of the using programmer as long as they do not introduce bugs or limitations (as do all the languages stuck on UCS-2 and many using UTF-16, including old Python narrow builds). Using what should generally be the universal transmission format, UFT-8, as the internal format means either losing indexing and slicing, having those operations slow from O(1) to O(len(string)), or adding an index table that is not part of the unicode standard. Using UTF-32 avoids the above but usually wasted space -- up to 75%. strange beasties like python's FSR Have you really let yourself be poisoned by JMF's bizarre rants? The FSR is an *internal optimization* that benefits most unicode operations that people actually perform. It uses UTF-32 by default but adapts to the strings users create by compressing the internal format. The compression is trivial -- simple dropping leading null bytes common to all characters -- so each character is still readable as is. The string headers records how many bytes are left. Is the idea of algorithms that adapt to inputs really strange to you? Like good adaptive algorthms, the FSR is invisible to the user except for reducing space or time or maybe both. Unicode operations are otherwise the same as with previous wide builds. People who used to use narrow-builds also benefit from bug elimination. The only 'headaches' involved might have been those of the developers who optimized previous wide builds. CPython has many other functions with special-case optimizations and 'fast paths' for common, simple cases. For instance, (some? all?) number operations are optimized for pairs of integers. Do you call these 'strange beasties'? PyPy is faster than CPython, when it is, because it is even more adaptable to particular computations by creating new fast paths. The mechanism to create these 'strange beasties' might have been a headache for the writers, but when it works, which it now seems to, it is not for the users. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-05-01 23:38, Terry Reedy wrote: On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. [snip] I think he's right when he says Unicode is a headache, but only because it's being used to handle languages which are, themselves, a headache: left-to-right versus right-to-left, sometimes on the same line; diacritics, possibly several on a glyph; etc. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 5:03:21 AM UTC+5:30, MRAB wrote: On 2014-05-01 23:38, Terry Reedy wrote: On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. [snip] I think he's right when he says Unicode is a headache, but only because it's being used to handle languages which are, themselves, a headache: left-to-right versus right-to-left, sometimes on the same line; diacritics, possibly several on a glyph; etc. Yes, the headaches go a little further back than Unicode. There is a certain large old book... In which is described the building of a 'tower that reached up to heaven'... At which point 'it was decided'¶ to do something to prevent that. And our headaches started. I dont know how one causally connects the 'headaches' but Ive seen - mojibake - unicode 'number-boxes' (what are these called?) - Worst of all what we *dont* see -- how many others dont see what we see? I never knew of any of this in the good ol days of ASCII ¶ Passive voice is often the best choice in the interests of political correctness It would be a pleasant surprise if everyone sees a pilcrow at start of line above -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 4:08:35 AM UTC+5:30, Terry Reedy wrote: On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. The first sentence is this misleading tautology: With ASCII, data is ASCII whether its file, core, terminal, or network; ie ABC is 65,66,67. Let me translate: If all text is ASCII encoded, then text data is ASCII, whether ... But it was never the case that all text was ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe still uses the latter. Other mainframe makers used other encodings of A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never universal. You could have just as well said With EBCDIC, data is EBCDIC, whether ... https://en.wikipedia.org/wiki/Ascii https://en.wikipedia.org/wiki/EBCDIC A crucial step in the spread of Ascii was its use for microcomputers, including the IBM PC. The latter was considered a toy by the mainframe guys. If they had known that PCs would partly take over the computing world, they might have suggested or insisted that the it use EBCDIC. With unicode there are: encodings where 'encodings' is linked to https://en.wikipedia.org/wiki/Character_encodings_in_HTML If html 'always' used utf-8 (like xml), as has become common but not universal, all of the problems with *non-unicode* character sets and encodings would disappear. The pre-unicode declarations could then disappear. More truthful: without unicode there are 100s of encodings and with unicode only 3 that we should worry about. in-memory formats These are not the concern of the using programmer as long as they do not introduce bugs or limitations (as do all the languages stuck on UCS-2 and many using UTF-16, including old Python narrow builds). Using what should generally be the universal transmission format, UFT-8, as the internal format means either losing indexing and slicing, having those operations slow from O(1) to O(len(string)), or adding an index table that is not part of the unicode standard. Using UTF-32 avoids the above but usually wasted space -- up to 75%. strange beasties like python's FSR Have you really let yourself be poisoned by JMF's bizarre rants? The FSR is an *internal optimization* that benefits most unicode operations that people actually perform. It uses UTF-32 by default but adapts to the strings users create by compressing the internal format. The compression is trivial -- simple dropping leading null bytes common to all characters -- so each character is still readable as is. The string headers records how many bytes are left. Is the idea of algorithms that adapt to inputs really strange to you? Like good adaptive algorthms, the FSR is invisible to the user except for reducing space or time or maybe both. Unicode operations are otherwise the same as with previous wide builds. People who used to use narrow-builds also benefit from bug elimination. The only 'headaches' involved might have been those of the developers who optimized previous wide builds. CPython has many other functions with special-case optimizations and 'fast paths' for common, simple cases. For instance, (some? all?) number operations are optimized for pairs of integers. Do you call these 'strange beasties'? Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. Why if optimizations are always desirable do C compilers have: -O0 O1 O2 O3 and zillions of more specific flags? JFTR I have no issue with FSR. What we have to hand to jmf - willingly or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them] I dont even know whether jmf has a real technical (as he calls it 'mathematical') issue or its entirely political: Why should I pay more for a EURO sign than a $ sign? Well perhaps that is more related to the exchange rate than to python! -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
Rustom Mody rustompm...@gmail.com writes: Yes, the headaches go a little further back than Unicode. Okay, so can you change your article to reflect the fact that the headaches both pre-date Unicode, and are made much easier by Unicode? There is a certain large old book... Ah yes, the neo-Sumerian story “Enmerkar_and_the_Lord_of_Aratta” URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta. Probably inspired by stories older than that, of course. In which is described the building of a 'tower that reached up to heaven'... At which point 'it was decided'¶ to do something to prevent that. And our headaches started. And other myths with fantastic reasons for the diversity of language URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language. I never knew of any of this in the good ol days of ASCII Yes, by ignoring all other writing systems except one's own – and thereby excluding most of the world's people – the system can be made simpler. Hopefully the proportion of programmers who still feel they can make such a parochial choice is rapidly shrinking. -- \ “Why doesn't Python warn that it's not 100% perfect? Are people | `\ just supposed to “know” this, magically?” —Mitya Sirenef, | _o__) comp.lang.python, 2012-12-27 | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 7:59:55 AM UTC+5:30, Rustom Mody wrote: Why should I pay more for a EURO sign than a $ sign? A unicode 'headache' there: I typed the Euro sign (trying again € ) not EURO Somebody -- I guess its GG in overhelpful mode -- converted it And made my post: Content-Type: text/plain; charset=ISO-8859-1 Will some devanagarari vowels help it stop being helpful? अ आ इ ई उ ऊ ए ऐ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 8:09:44 AM UTC+5:30, Ben Finney wrote: Rustom Mody writes: Yes, the headaches go a little further back than Unicode. Okay, so can you change your article to reflect the fact that the headaches both pre-date Unicode, and are made much easier by Unicode? Predate: Yes Made easier: No There is a certain large old book... Ah yes, the neo-Sumerian story Enmerkar_and_the_Lord_of_Aratta URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta. Probably inspired by stories older than that, of course. Thanks for that link In which is described the building of a 'tower that reached up to heaven'... At which point 'it was decided'¶ to do something to prevent that. And our headaches started. And other myths with fantastic reasons for the diversity of language URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language. This one takes the cake - see 1st para http://hilgart.org/enformy/BronsonRekindling.pdf I never knew of any of this in the good ol days of ASCII Yes, by ignoring all other writing systems except one's own - and thereby excluding most of the world's people - the system can be made simpler. Hopefully the proportion of programmers who still feel they can make such a parochial choice is rapidly shrinking. See link above: Ethnic differences and chauvinism are invariably linked -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, May 2, 2014 at 12:29 PM, Rustom Mody rustompm...@gmail.com wrote: Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. It doesn't, and it has only to do with testing. I've had similar issues at times; for instance, trying to benchmark one language or language construct against another often means fighting against an optimizer. (How, for instance, do you figure out what loop overhead is, when an empty loop is completely optimized out?) This is nothing whatsoever to do with Unicode, nor to do with the optimization that Python and Pike (and maybe other languages) do with the storage of Unicode strings. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Thu, 01 May 2014 18:38:35 -0400, Terry Reedy wrote: strange beasties like python's FSR Have you really let yourself be poisoned by JMF's bizarre rants? The FSR is an *internal optimization* that benefits most unicode operations that people actually perform. It uses UTF-32 by default but adapts to the strings users create by compressing the internal format. The compression is trivial -- simple dropping leading null bytes common to all characters -- so each character is still readable as is. For anyone who, like me, wasn't convinced that Unicode worked that way, you can see for yourself that it does. You don't need Python 3.3, any version of 3.x will work. In Python 2.7, it should work if you just change the calls from chr() to unichr(): py for i in range(256): ... c = chr(i) ... u = c.encode('utf-32-be') ... assert u[:3] == b'\0\0\0' ... assert u[3:] == c.encode('latin-1') ... py for i in range(256, 0x+1): ... c = chr(i) ... u = c.encode('utf-32-be') ... assert u[:2] == b'\0\0' ... assert u[2:] == c.encode('utf-16-be') ... py So Terry is correct: dropping leading zeroes, and treating the remainder as either Latin-1 or UTF-16, works fine, and potentially saves a lot of memory. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 8:31:56 AM UTC+5:30, Chris Angelico wrote: On Fri, May 2, 2014 at 12:29 PM, Rustom Mody wrote: Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. It doesn't, and it has only to do with testing. I've had similar issues at times; for instance, trying to benchmark one language or language construct against another often means fighting against an optimizer. (How, for instance, do you figure out what loop overhead is, when an empty loop is completely optimized out?) This is nothing whatsoever to do with Unicode, nor to do with the optimization that Python and Pike (and maybe other languages) do with the storage of Unicode strings. This was said in response to Terry's CPython has many other functions with special-case optimizations and 'fast paths' for common, simple cases. For instance, (some? all?) number operations are optimized for pairs of integers. Do you call these 'strange beasties'? which evidently vanished -- optimized out :D -- in multiple levels of quoting -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/1/2014 7:33 PM, MRAB wrote: On 2014-05-01 23:38, Terry Reedy wrote: On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. [snip] I think he's right when he says Unicode is a headache, but only because it's being used to handle languages which are, themselves, a headache: left-to-right versus right-to-left, sometimes on the same line; Handling that without unicode is even worse. diacritics, possibly several on a glyph; etc. Ditto. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Friday, May 2, 2014 9:46:36 AM UTC+5:30, Terry Reedy wrote: On 5/1/2014 7:33 PM, MRAB wrote: On 2014-05-01 23:38, Terry Reedy wrote: On 5/1/2014 2:04 PM, Rustom Mody wrote: Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I will not comment on the Unix-assumption part, but I think you go wrong with this: Unicode is a Headache. The major headache is that unicode and its very few encodings are not universally used. The headache is all the non-unicode legacy encodings still being used. So you better title this section 'Non-Unicode is a Headache'. [snip] I think he's right when he says Unicode is a headache, but only because it's being used to handle languages which are, themselves, a headache: left-to-right versus right-to-left, sometimes on the same line; Handling that without unicode is even worse. diacritics, possibly several on a glyph; etc. Ditto. Whats the best cure for headache? Cut off the head Whats the best cure for Unicode? Ascii Saying however that there is no headache in unicode does not make the headache go away: http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ No I am not saying that the contents/style/tone are right. However people are evidently suffering the transition. Denying it is not a help. And unicode consortium's ways are not exactly helpful to its own cause: Imagine the C standard committee deciding that adding mandatory garbage collection to C is a neat idea Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what in the future is similar. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Fri, May 2, 2014 at 2:42 PM, Rustom Mody rustompm...@gmail.com wrote: Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what in the future is similar. Unicode 1.0: Let's make a single universal character set that can represent all the world's scripts. We'll define 65536 codepoints to do that with. Unicode 2.0: Oh. That's not enough. Okay, let's define some more. It's not a fundamental change, nor is it unhelpful to Unicode's cause. It's simply an acknowledgement that 64K codepoints aren't enough. Yes, that gave us the mess of UTF-16 being called Unicode (if it hadn't been for Unicode 1.0, I doubt we'd now have so many languages using and exposing UTF-16 - it'd be a simple judgment call, pick UTF-8/UTF-16/UTF-32 based on what you expect your users to want to use), but it doesn't change Unicode's goal, and it also doesn't indicate that there's likely to be any more such changes in the future. (Just look at how little of the Unicode space is allocated so far.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 5/1/2014 10:29 PM, Rustom Mody wrote: Here is an instance of someone who would like a certain optimization to be dis-able-able https://mail.python.org/pipermail/python-list/2014-February/667169.html To the best of my knowledge its nothing to do with unicode or with jmf. Right. Ned has an actual technical reason to complain, even though the developers do not consider it strong enough to act. Why if optimizations are always desirable do C compilers have: -O0 O1 O2 O3 and zillions of more specific flags? One reason is that many optimizations sometimes introduce bugs, or to put it another way, they are based on assumptions that are not true for all code. For instance, some people have suggested that CPython should have an optional optimization based on the assumption that builtin names are never rebound. That is true for perhaps many code files, but definitely not all. Guido does not seem to like such conditional optimizations. I can think of three reasons for not adding to the numerous options CPython already has. 1. We do not have the developers resources to handle the added complications of multiple optimization options. 2. Zillions of options and flags confuse users. As it is, most options are seldom used. 3. Optimization options are easily misused, possibly leading to silently buggy results, or mysterious failures. For instance, people sometimes rebind builtins without realizing what they have done, such as using 'id' as a parameter name. Being in the habit of routinely using the 'assume no rebinding option' would lead to problems. I am rather sure that the string (unicode) test suite was reviewed and the performance of 3.2 wide builds recorded before the new implementation was committed. The tracker currently has 37 behavior (bug) issues marked for the unicode component. In a quick review, I do not see that any have anything to do with using standard UTF-32 versus adaptive UTF-32. Indeed, I believe a majority of the 37 were filed before 3.3 or are 2.7 specific. Problems with FSR itself have been fixed as discovered. JFTR I have no issue with FSR. What we have to hand to jmf - willingly or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them] Somewhat ironically, I suppose your are right. I dont even know whether jmf has a real technical (as he calls it 'mathematical') issue or its entirely political: I would call his view personal or philosophical. I only object to endless repetition and the deception of claiming that personal views are mathematical facts. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
@ Time Chase I'm perfectly aware about what I'm doing. @ MRAB ...Although the third example is the fastest, it's also the wrong way to handle Unicode: ... Maybe that's exactly the opposite. It illustrates very well, the quality of coding schemes endorsed by Unicode.org. I deliberately choose utf-8. sys.getsizeof('\u0fce') 40 sys.getsizeof('\u0fce'.encode('utf-8')) 20 sys.getsizeof('\u0fce'.encode('utf-16-be')) 19 sys.getsizeof('\u0fce'.encode('utf-32-be')) 21 Q. How to save memory without wasting time in encoding? By using products using natively the unicode coding schemes? Are you understanding unicode? Or are you understanding unicode via Python? --- A Tibetan monk [*] using Py32: timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z') [2.3394840182882186, 2.3145832750782653, 2.3207231951529685] timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = '\u0fce') [2.328517624800078, 2.3169403900011076, 2.317586282812048] [*] Your curiosity has certainly shown, what this code point means. For the others: U+0FCE TIBETAN SIGN RDEL NAG RDEL DKAR signifies good luck earlier, bad luck later (My comment: Good luck with Python or bad luck with Python) jmf -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-04-30 00:06, wxjmfa...@gmail.com wrote: @ Time Chase I'm perfectly aware about what I'm doing. Apparently, you're quite adept at appending superfluous characters to sensible strings...did you benchmark your email composition, too? ;-) -tkc (aka Tim, not Time) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Tue, 29 Apr 2014 21:53:22 -0700, Rustom Mody wrote: On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: While I dislike feeding the troll, what I see here is: snipped Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html I disagree with much of your characterisation of the Unix assumption, and I point out that out of the two most widespread flavours of OS today, Linux/Unix and Windows, it is *Windows* and not Unix which still regularly uses legacy encodings. Also your link to Joel On Software mistakenly links to me instead of Joel. There's a missing apostrophe in Ive [sic] in Acknowledgment #2. I didn't notice any other typos. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-04-29 10:37, wxjmfa...@gmail.com wrote: timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z') [1.4027834829454946, 1.38714224331963, 1.3822586635296261] timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = '\u0fce') [5.462776291480395, 5.4479432055423445, 5.447874284053398] # more interesting timeit.repeat((x*1000 + y)[:-1],\ ... setup=x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')) [1.3496489533188765, 1.328654286266783, 1.3300913977710707] While I dislike feeding the troll, what I see here is: on your machine, all unicode manipulations in the test should take ~5.4 seconds. But Python notices that some of your strings *don't* require a full 32-bits and thus optimizes those operations, cutting about 75% of the processing time (wow...4-bytes-per-char to 1-byte-per-char, I wonder where that 75% savings comes from). So rather than highlight any *problem* with Python, your [mostly worthless microbenchmark non-realworld] tests show that Python's unicode implementation is awesome. Still waiting to see an actual bug-report as mentioned on the other thread. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On 2014-04-29 18:37, wxjmfa...@gmail.com wrote: Let see how Python is ready for the next Unicode version (Unicode 7.0.0.Beta). timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = 'z') [1.4027834829454946, 1.38714224331963, 1.3822586635296261] timeit.repeat((x*1000 + y)[:-1], setup=x = 'abc'; y = '\u0fce') [5.462776291480395, 5.4479432055423445, 5.447874284053398] # more interesting timeit.repeat((x*1000 + y)[:-1],\ ... setup=x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')) [1.3496489533188765, 1.328654286266783, 1.3300913977710707] Although the third example is the fastest, it's also the wrong way to handle Unicode: x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8') t = (x*1000 + y)[:-1].decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 3000-3001: unex pected end of data Note 1: lookup is not the problem. Note 2: From Unicode.org : [...] We strongly encourage [...] and test them with their programs [...] - Done. jmf -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode 7
On Tuesday, April 29, 2014 11:29:23 PM UTC+5:30, Tim Chase wrote: While I dislike feeding the troll, what I see here is: snipped Since its Unicode-troll time, here's my contribution http://blog.languager.org/2014/04/unicode-and-unix-assumption.html :-) More seriously, since Ive quoted some esteemed members of this list explicitly (Steven) and the list in general, please let me know if something is inaccurate or inappropriate -- https://mail.python.org/mailman/listinfo/python-list