Re: Newbie question about text encoding
On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote: Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+ are *noncharacters*, not invalid. There are a total of 66 noncharacters in Unicode, and they are legal in strings. Interesting -- Thanks! I wonder whether that's one more instance of the anti-pattern (other thread)? Number thats not a number -- Nan Pointer that points nowhere -- NULL SQL data thats not there but there -- null http://www.unicode.org/faq/private_use.html#nonchar8 I think the only illegal code points are surrogates. Surrogates should only appear as bytes in UTF-16 byte-strings. Even more interesting: So there's a whole hierarchy of illegality?? Could you suggest some good reference for 'surrogate'? -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Ben Finney ben+pyt...@benfinney.id.au: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Then we're back to square one: b'\x80'.decode('utf-8', errors='surrogateescape') '\udc80' Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+ are *noncharacters*, not invalid. There are a total of 66 noncharacters in Unicode, and they are legal in strings. http://www.unicode.org/faq/private_use.html#nonchar8 I think the only illegal code points are surrogates. Surrogates should only appear as bytes in UTF-16 byte-strings. U+FFFE would cause problems at the beginning of a UTF-16 stream, as it could be mistaken for a BOM - that's why it's a noncharacter. But sure, let's leave them out of the discussion. The question is whether surrogates are legal or not. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+ are *noncharacters*, not invalid. There are a total of 66 noncharacters in Unicode, and they are legal in strings. http://www.unicode.org/faq/private_use.html#nonchar8 I think the only illegal code points are surrogates. Surrogates should only appear as bytes in UTF-16 byte-strings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. But it is a valid non-character code point. It is quite correct to throw this error. '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like encoding of surrogates; but it's illegal according to the RFC: https://tools.ietf.org/html/rfc3629 The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. They're not valid characters, and the UTF-8 spec explicitly says that they must not be encoded. Python is fully spec-compliant in rejecting these. Some encoders [1] will permit them, but the resulting stream is invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being UTF-8, only U+ is represented as C0 80). ChrisA [1] eg http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html optionally -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind of thing that's usually done in an obscure language before it hits a mainstream one. Pike is similar to Python here. I can create a string with invalid code points in it: \uFFFE\uDD00; (1) Result: \ufffe\udd00 but I can't UTF-8 encode that: string_to_utf8(\uFFFE\uDD00); Character 0xdd00 at index 1 is in the surrogate range and therefore invalid. Unknown program: string_to_utf8(\ufffe\udd00) HilfeInput:1: HilfeInput()-___HilfeWrapper() Or, using the streaming UTF-8 encoder instead of the short-hand: Charset.encoder(UTF-8)-feed(\uFFFE\uDD00)-drain(); Error encoding \ufffe[0xdd00] using utf8: Unsupported character 56576. /usr/local/pike/8.1.0/lib/modules/_Charset.so:1: _Charset.UTF8enc()-feed(\ufffe\udd00) HilfeInput:1: HilfeInput()-___HilfeWrapper() Does anyone know of a language where you can't even construct the string? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Rustom Mody wrote: On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Literal/Legalistic answer: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135 Nice one :-) but not exactly in the spirit of what we're discussing (as you acknowledge below), so I won't discuss that. [And already quoted at http://blog.languager.org/2015/03/whimsical-unicode.html ] An answer more in the spirit of what I am trying to say: Idle3, Roy's example and in general all systems that are python-centric but use components outside of python that are unicode-broken IOW I would expect people (at least people with good faith) reading my bug-prone-system code...seemingly working code such as python 3... to interpret that NOT as python 3 is seemingly working but actually broken Why not? That is the natural interpretation of the sentence, particularly in the context of your previous sentence: [quote] Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. [end quote] The natural interpretation of this is that Python 3 is only *seemingly* working, but is also an example of a code base in bug-prone or plain buggy state. If that's not your intended meaning, then rather than casting aspersions on my honesty (good faith indeed) you might accept that perhaps you didn't quite manage to get your message across. But as Apps made with working system code (eg python3) can end up being broken because of other non-working system code - eg mysql, java, javascript, windows-shell, and ultimately windows, linux Don't forget viruses or other malware, cosmic rays, processor bugs, dry solder joints on the motherboard, faulty memory, and user-error. I'm not sure what point you think you are making. If you want to discuss the fact that complex systems have more interactions than simple systems, and therefore more ways for things to go wrong, I will agree. I'll agree that this is an issue with Python code that interacts with other systems which may or may not implement Unicode correctly. There are a few ways to interpret this: (1) You're making a general point about the complexity of modern computing. (2) You're making the point that dealing with text encodings in general, and Unicode in specific, is hard because of the interaction of programming language, database, file system, locale, etc. (3) You're implying that Python ought to fix this problem some how. (4) You're implying that *Unicode* specifically is uniquely problematic in this way. Or at least *unusual* to be problematic in this way. I will agree with 1 and 2; I'll say that 3 would be nice but in the absence of concrete proposals for how to fix it, it's just meaningless chatter. And I'll disagree strongly with 4. Unicode came into existence because legacy encodings suffer from similar problems, only worse. (One major advantage of Unicode over previous multi-byte encodings is that the UTF encodings are self-healing. A single corrupted byte will, *at worst*, cause a single corrupted code point.) In one sense, Unicode has solved these legacy encoding problems, in the sense that if you always use a correct implementation of Unicode then you won't *ever* suffer from problems like moji-bake, broken strings and so forth. In another sense, Unicode hasn't solved these legacy problems because we still have to deal with files using legacy encodings, as well as standards organisations, operating systems, developers, applications and users who continue to produce new content using legacy encodings, buggy or incorrect implementations of the standard, also viruses, cosmic rays, dry solder joints and user-error. How are these things Unicode's fault or responsibility? -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. But it is a valid non-character code point. It is quite correct to throw this error. '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. '\udd00' '\udd00' '\udd00'.encode('utf-32') b'\xff\xfe\x00\x00\x00\xdd\x00\x00' '\udd00'.encode('utf-16') b'\xff\xfe\x00\xdd' If you explicitly specify the endianness (say, utf-16-be or -le) then you don't get the BOMs. I was simply stating that UTF-8 is not a bijection between unicode strings and octet strings (even forgetting Python). Enriching Unicode with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not without side effects. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. The problem is that no matter how you shuffle surrogates, encoding schemes, coding points and the like, a wrinkle always remains. I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But that's where the buck stops; traditional arithmetic functions are closed under ℂ. Unicode apparently hasn't found a similar closure. That's why I think that while UTF-8 is a fabulous way to bring Unicode to Linux, Linux should have taken the tack that Unicode is always an application-level interpretation with few operating system tie-ins. Unfortunately, the GNU world is busy trying to build a Unicode frosting everywhere. The illusion can never be complete but is convincing enough for application developers to forget to handle corner cases. To answer your question, I think every code point from 0 to 1114111 should be treated as valid and analogous. Thus Python is correct here: len('\udd00') 1 len('\ufeff') 1 The alternatives are far too messy to consider. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Steven D'Aprano wrote: Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Are you talking about the fact that not all byte streams are valid UTF-8? That is, some byte objects b may raise an exception on b.decode('utf-8'). Eh, I should have read the rest of the thread before replying... I don't see why that means UTF-8 suffers badly from this. Can you give an example of where you would expect to take an arbitrary byte-stream, decode it as UTF-8, and expect the results to be meaningful? File names on Unix-like systems. Unfortunately file names are a bit of a mess, but we're slowly converging on Unicode support for files. I reckon that by 2070, 2080 tops, we'll have that licked... The three major operating systems have different levels of support for Unicode file names: * Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I think this is the strictest Unicode support of all common file systems. Well done Apple. Decomposed in this sense means that single code points may be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT. * Windows: NTFS stores file names as sequences of 16-bit code units except 0x. (Additional restrictions also apply: e.g. in POSIX mode, / is also forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are interpreted as UTF-16 but the file system doesn't prevent you from creating file names with invalid sequences. * Linux: ext2/ext3 stores file names as arbitrary bytes except for / and nul. However most Linux distributions treat file names as if they were UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file managers enforce the rule that file names are valid UTF-8. File systems on removable media (FAT32, UDF, ISO-9660 with or without extensions such as Joliet and Rock Ridge) have their own issues, but generally speaking don't support Unicode well or at all. So although the current situation is still a bit of a mess, there is a slow move towards file names which are valid Unicode. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. It is quite correct to throw this error. '\udd00' is a valid str object: '\udd00' '\udd00' '\udd00'.encode('utf-32') b'\xff\xfe\x00\x00\x00\xdd\x00\x00' '\udd00'.encode('utf-16') b'\xff\xfe\x00\xdd' I was simply stating that UTF-8 is not a bijection between unicode strings and octet strings (even forgetting Python). Enriching Unicode with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not without side effects. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa ma...@pacujo.net wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. It is quite correct to throw this error. '\udd00' is a valid str object: '\udd00' '\udd00' '\udd00'.encode('utf-32') b'\xff\xfe\x00\x00\x00\xdd\x00\x00' '\udd00'.encode('utf-16') b'\xff\xfe\x00\xdd' I was simply stating that UTF-8 is not a bijection between unicode strings and octet strings (even forgetting Python). Enriching Unicode with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not without side effects. But it's not a valid Unicode string, so a Unicode encoding can't be expected to cope with it. Mathematically, 0xC0 0x80 would represent U+, and some UTF-8 codecs generate and accept this (in order to allow U+ without ever yielding 0x00), but that doesn't mean that UTF-8 should allow that byte sequence. The only reason to craft some kind of Unicode string for any arbitrary sequence of bytes is the smuggling effect used for file name handling. There is no reason to support invalid Unicode codepoints. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote: On 07Mar2015 22:09, Steven D'Aprano wrote: Rustom Mody wrote: [...big snip...] Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, breaks mysql. So whole system broke due to those 4 in 200,000,000 records No, they broke because MySQL has buggy Unicode handling. [...] You could also choose do with astral crap (Roy's words) what we all do with crap -- throw it out as early as possible. And when Roy's customers demand that his product support emoji, or complain that they cannot spell their own name because of his parochial and ignorant idea of crap, perhaps he will consider doing what he should have done from the beginning: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. [1] So I have been told. I use MySQL a fair bit, and Postgres very slightly. I would agree with your characterisation above; MySQL is littered with inconsistencies and arbitrary breakage, both in tools and SQL implementation. And Postgres has been a pure pleasure to work with, little though I have done that so far. Cheers, Cameron Simpson There is no human problem which could not be solved if people would simply do as I advise. - Gore Vidal I think that last quote sums up the issue best. Ive written to Intel asking them to make their next generation have 21-bit wide bytes. Once they do that we will be back in the paradise we have been for the last 40 years which I call the 'Unix-assumption' http://blog.languager.org/2014/04/unicode-and-unix-assumption.html Until then... We have to continue living in the real world. Which includes 10 times more windows than linux users. Is windows 10 times better an OS than linux? In the 'real world' people make choices for all sorts of reasons. My guess is the top reason is the pointiness of the hair of pointy-haired-boss. Just like people choose windows over linux, people choose mysql over postgres, and that's the context of this discussion -- people stuck in sub-optimal choices -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my mind, erroneous syntax. -- \ “Please do not feed the animals. If you have any suitable food, | `\ give it to the guard on duty.” —zoo, Budapest | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07Mar2015 22:09, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Rustom Mody wrote: [...big snip...] Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, breaks mysql. So whole system broke due to those 4 in 200,000,000 records No, they broke because MySQL has buggy Unicode handling. [...] You could also choose do with astral crap (Roy's words) what we all do with crap -- throw it out as early as possible. And when Roy's customers demand that his product support emoji, or complain that they cannot spell their own name because of his parochial and ignorant idea of crap, perhaps he will consider doing what he should have done from the beginning: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. [1] So I have been told. I use MySQL a fair bit, and Postgres very slightly. I would agree with your characterisation above; MySQL is littered with inconsistencies and arbitrary breakage, both in tools and SQL implementation. And Postgres has been a pure pleasure to work with, little though I have done that so far. Cheers, Cameron Simpson c...@zip.com.au There is no human problem which could not be solved if people would simply do as I advise. - Gore Vidal -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney ben+pyt...@benfinney.id.au wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my mind, erroneous syntax. For the string literal, I would say SyntaxError is more appropriate than ValueError, as a string object has to be constructed at compilation time. I'd still like to see a report from someone who has used a language that specifically disallows all surrogates in strings. Does it help? Is it more hassle than it's worth? Are there weird edge cases that it breaks? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my mind, erroneous syntax. In this hypothetical, it's a problem with evaluating a literal - in the same way that '\U12345', or '\U0011, is. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. The problem is that no matter how you shuffle surrogates, encoding schemes, coding points and the like, a wrinkle always remains. Really? Define your terms. Can you define wrinkles, and prove that it is impossible to remove them? What's so bad about wrinkles anyway? I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But that's where the buck stops; traditional arithmetic functions are closed under ℂ. That's simply incorrect. What's z/(0+0i)? There are many more number sets used by mathematicians, some going back to the 1800s. Here are just a few: * ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ. * ℝ-caret or ℝ+{∞}, which does the same but with a single unsigned infinity. * A similar extended version of ℂ with a single infinity. * Split-complex or hyperbolic numbers, defined similarly to ℂ except with i**2 = +1 (rather than the complex i**2 = -1). * Dual numbers, which add a single infinitesimal number ε != 0 with the property that ε**2 = 0. * Hyperreal numbers. * John Conway's surreal numbers, which may be the largest possible set, in the sense that it can construct all finite, infinite and infinitesimal numbers. (The hyperreals and dual numbers can be considered subsets of the surreals.) The process of extending ℝ to ℂ is formally known as Cayley–Dickson construction, and there is an infinite number of algebras (and hence number sets) which can be constructed this way. The next few are: * Hamilton's quaternions ℍ, very useful for dealing with rotations in 3D space. They fell out of favour for some decades, but are now experiencing something of a renaissance. * Octonions or Cayley numbers. * Sedenions. Unicode apparently hasn't found a similar closure. Similar in what way? And why do you think this is important? It is not a requirement for every possible byte sequence to be a valid Unicode string, any more than it is a requirement for every possible byte sequence to be valid JPG, zip archive, or ELF executable. Some byte strings simply are not JPG images, zip archives or ELF executables -- or Unicode strings. So what? Why do you think that is a problem that needs fixing by the Unicode standard? It may be a problem that needs fixing by (for example) programming languages, and Python invented the surrogatesescape encoding to smuggle such invalid bytes into strings. Other solutions may exist as well. But that's not part of Unicode and it isn't a problem for Unicode. That's why I think that while UTF-8 is a fabulous way to bring Unicode to Linux, Linux should have taken the tack that Unicode is always an application-level interpretation with few operating system tie-ins. Should have? That is *exactly* the status quo, and while it was the only practical solution given Linux's history, it's a horrible idea. That Unicode is stuck on top of an OS which is unaware of Unicode is precisely why we're left with problems like how do you represent arbitrary bytes as Unicode strings?. Unfortunately, the GNU world is busy trying to build a Unicode frosting everywhere. The illusion can never be complete but is convincing enough for application developers to forget to handle corner cases. To answer your question, I think every code point from 0 to 1114111 should be treated as valid and analogous. Your opinion isn't very relevant. What is relevant is what the Unicode standard demands, and I think it requires that strings containing surrogates are illegal (rather like x/0 is illegal in the real numbers). Wikipedia states: The Unicode standard permanently reserves these code point values [U+D800 to U+DFFF] for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points. However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case. http://en.wikipedia.org/wiki/UTF-16 So yet again we are left with the
Re: Newbie question about text encoding
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: For those cases where you do wish to take an arbitrary byte stream and round-trip it, Python now provides an error handler for that. py import random py b = bytes([random.randint(0, 255) for _ in range(1)]) py s = b.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0: invalid start byte py s = b.decode('utf-8', errors='surrogateescape') py s.encode('utf-8', errors='surrogateescape') == b True That is indeed a valid workaround. With it we achieve b.decode('utf-8', errors='surrogateescape'). \ encode('utf-8', errors='surrogateescape') == b for any bytes b. It goes to great lengths to address the Linux programmer's situation. However, * it's not UTF-8 but a variant of it, * it sacrifices the ordering correspondence of UTF-8: '\udc80' 'ä' True '\udc80'.encode('utf-8', errors='surrogateescape') \ ...'ä'.encode('utf-8', errors='surrogateescape') False * it still isn't bijective between str and bytes: '\udd00'.encode('utf-8', errors='surrogateescape') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udd00' in position 0: surrogates not allowed Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Literal/Legalistic answer: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135 [And already quoted at http://blog.languager.org/2015/03/whimsical-unicode.html ] An answer more in the spirit of what I am trying to say: Idle3, Roy's example and in general all systems that are python-centric but use components outside of python that are unicode-broken IOW I would expect people (at least people with good faith) reading my bug-prone-system code...seemingly working code such as python 3... to interpret that NOT as python 3 is seemingly working but actually broken But as Apps made with working system code (eg python3) can end up being broken because of other non-working system code - eg mysql, java, javascript, windows-shell, and ultimately windows, linux -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa ma...@pacujo.net wrote: * it still isn't bijective between str and bytes: '\udd00'.encode('utf-8', errors='surrogateescape') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udd00' in position 0: surrogates not allowed Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. It is quite correct to throw this error. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Please provide an example; that sounds like a bug. If there is any invalid UTF-8 stream which decodes without an error, it is actually a security bug, and should be fixed pronto in all affected and supported versions. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote: Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Please provide an example; that sounds like a bug. If there is any invalid UTF-8 stream which decodes without an error, it is actually a security bug, and should be fixed pronto in all affected and supported versions. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. That's not the same as what you said. All you've proven is that there are bit patterns which are not UTF-8 streams... which is a very deliberate feature. How does UTF-8 *suffer* from this? It benefits hugely! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 16:25, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Please provide an example; that sounds like a bug. If there is any invalid UTF-8 stream which decodes without an error, it is actually a security bug, and should be fixed pronto in all affected and supported versions. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. That's not the same as what you said. Except that it's precisely what I said. All you've proven is that there are bit patterns which are not UTF-8 streams... And that causes problems. which is a very deliberate feature. Well, nobody desired it. It was just something that had to give. I believe you *could* have defined it as a bijective mapping but then you would have lost the sorting order correspondence. How does UTF-8 *suffer* from this? It benefits hugely! You can't operate on file names and text files using Python strings. Or at least, you will need to add (nontrivial) exception catching logic. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. 2) You forgot to put the path on that, so it failed to find the file. Here's my version of your demo: open(/tmp/xyz/+os.listdir('/tmp/xyz')[0]) _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8' Looks fine to me. I stand corrected. Then we have: os.listdir()[0].encode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed So? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers d...@tombstonezero.net wrote: I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after* we've agreed that filesystems contain files in hierarchical directories. I think you and I are in agreement. No idea about Marko, I'm still not entirely sure what he's saying. Python can't smooth out all of the bumps in file systems, any more than Unicode can smooth out the bumps in natural language, or TCP can smooth out the bumps in IP. The abstraction layers help, but every now and then they leak, and you have to cope with the underlying mess. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 16:48, Marko Rauhamaa wrote: Mark Lawrence breamore...@yahoo.co.uk: On 07/03/2015 16:25, Marko Rauhamaa wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more information. b'\x80'.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte Marko It would clearly help if you were to type in the correct UK English accent. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Dan Sommers d...@tombstonezero.net: I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after* we've agreed that filesystems contain files in hierarchical directories. A whole new set of problems took root with Unicode. There were gains but there were losses, too. Python is not alone in the conceptual difficulties. Guile 2's (readdir) simply converts bad UTF-8 in a filename into a question mark: scheme@(guile-user) [1] (readdir s) $3 = ? scheme@(guile-user) [4] (equal? $3 ?) $4 = #t So does lxterminal: $ ls ? even though it's all bytes on the inside: $ [ $(ls) = ? ] $ echo $? 1 Scripts that make use of standard text utilities must now be very careful: $ ls | egrep ^.$ | wc -l 0 You are well advised to sprinkle LANG=C in your scripts: $ ls | LANG=C egrep ^.$ | wc -l 1 Nasty locale-related bugs plague installation scripts, whose writers are not accustomed to running their tests in myriads of locales. The topic is of course larger than just Unicode. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence breamore...@yahoo.co.uk wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. He was talking about this line of code: b.decode('utf-8').encode('utf-8') == b With the above assignment, that does indeed throw an error - which is correct behaviour. Challenge: Figure out a byte-string input that will make this function return True. def is_utf8_broken(b): return b.decode('utf-8').encode('utf-8') != b Correct responses for this function are either False or raising an exception. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Mark Lawrence breamore...@yahoo.co.uk: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: See: $ mkdir /tmp/xyz $ touch /tmp/xyz/ \x80' $ python3 Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more information. import os os.listdir('/tmp/xyz') ['\udc80'] open(os.listdir('/tmp/xyz')[0]) Traceback (most recent call last): File stdin, line 1, in module FileNotFoundError: [Errno 2] No such file or directory: '\udc80' File names encoded with Latin-X are quite commonplace even in UTF-8 locales. That is not a problem with UTF-8, though. I don't understand how you're blaming UTF-8 for that. There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, ergo the decode to Unicode has to have some special handling of failing bytes. 2) You forgot to put the path on that, so it failed to find the file. Here's my version of your demo: open(/tmp/xyz/+os.listdir('/tmp/xyz')[0]) _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8' Looks fine to me. Alternatively, if you pass a byte string to os.listdir, you get back a list of byte string file names: os.listdir(b/tmp/xyz) [b'\x80'] open(b/tmp/xyz/+os.listdir(b'/tmp/xyz')[0]) _io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8' Either way works. You can use bytes or text, and if you use text, there is a way to smuggle bytes through it. None of this has anything to do with UTF-8 as an encoding. (Note that the encoding='UTF-8' note in the response has to do with the presumed encoding of the file contents, not of the file name. As an empty file, it can be considered to be a stream of zero Unicode characters, encoded UTF-8, so that's valid.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 17:16, Marko Rauhamaa wrote: Mark Lawrence breamore...@yahoo.co.uk: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko You've been a PITA ever since you first joined this list, what about it? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type help, copyright, credits or license for more information. import os type(os.listdir(os.curdir)[0]) class 'str' Help on module os: DESCRIPTION This exports: - os.curdir is a string representing the current directory ('.' or ':') - os.pardir is a string representing the parent directory ('..' or '::') Explicitly documented as strings. If you want to work with strings, work with strings. If you want to work with bytes, don't use os.curdir, use bytes instead. Personally, I'm happy using strings, but if you want to go down the path of using bytes, you simply have to be consistent, and that probably means being platform-dependent anyway, so just use b. for the current directory. I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after* we've agreed that filesystems contain files in hierarchical directories. Dan -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote: You can't operate on file names and text files using Python strings. Or at least, you will need to add (nontrivial) exception catching logic. You can't operate on a JPG file using a Unicode string, nor an array of integers. What of it? You can't operate on an array of integers using a dictionary, either. So? How is this a failing of UTF-8? If you really REALLY can't use the bytes() type to work with something that is, yaknow, bytes, then you could use an alternative encoding that has a value for every byte. It's still not Unicode text, so it doesn't much matter which encoding you use. But it's much better to use the bytes type to work with bytes. It is not text, so don't treat it as text. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico ros...@gmail.com: If you really REALLY can't use the bytes() type to work with something that is, yaknow, bytes, then you could use an alternative encoding that has a value for every byte. It's still not Unicode text, so it doesn't much matter which encoding you use. But it's much better to use the bytes type to work with bytes. It is not text, so don't treat it as text. See: $ mkdir /tmp/xyz $ touch /tmp/xyz/$'\x80' $ python3 Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more information. import os os.listdir('/tmp/xyz') ['\udc80'] open(os.listdir('/tmp/xyz')[0]) Traceback (most recent call last): File stdin, line 1, in module FileNotFoundError: [Errno 2] No such file or directory: '\udc80' File names encoded with Latin-X are quite commonplace even in UTF-8 locales. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type help, copyright, credits or license for more information. import os type(os.listdir(os.curdir)[0]) class 'str' -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 18:34, Dan Sommers wrote: On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type help, copyright, credits or license for more information. import os type(os.listdir(os.curdir)[0]) class 'str' Help on module os: DESCRIPTION This exports: - os.curdir is a string representing the current directory ('.' or ':') - os.pardir is a string representing the parent directory ('..' or '::') Explicitly documented as strings. If you want to work with strings, work with strings. If you want to work with bytes, don't use os.curdir, use bytes instead. Personally, I'm happy using strings, but if you want to go down the path of using bytes, you simply have to be consistent, and that probably means being platform-dependent anyway, so just use b. for the current directory. I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after* we've agreed that filesystems contain files in hierarchical directories. Dan Isn't pathlib https://docs.python.org/3/library/pathlib.html#module-pathlib effectively a more recent attempt at smoothing or even removing (some of) the bumps? Has anybody here got experience of it as I've never used it? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Mark Lawrence breamore...@yahoo.co.uk: On 07/03/2015 16:25, Marko Rauhamaa wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more information. b'\x80'.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote: All you've proven is that there are bit patterns which are not UTF-8 streams... And that causes problems. Demonstrate. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: File names encoded with Latin-X are quite commonplace even in UTF-8 locales. That is not a problem with UTF-8, though. I don't understand how you're blaming UTF-8 for that. I'm saying it creates practical problems. There's a snake in the paradise. There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). 2) You forgot to put the path on that, so it failed to find the file. Here's my version of your demo: open(/tmp/xyz/+os.listdir('/tmp/xyz')[0]) _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8' Looks fine to me. I stand corrected. Then we have: os.listdir()[0].encode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python itself violates that principle by having os.listdir() return str objects (to mention one example). Only because you gave it a str with the path name. If you want to refer to file names using bytes, then be consistent and refer to ALL file names using bytes. As I demonstrated, that works just fine. Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type help, copyright, credits or license for more information. import os type(os.listdir(os.curdir)[0]) class 'str' Help on module os: DESCRIPTION This exports: - os.curdir is a string representing the current directory ('.' or ':') - os.pardir is a string representing the parent directory ('..' or '::') Explicitly documented as strings. If you want to work with strings, work with strings. If you want to work with bytes, don't use os.curdir, use bytes instead. Personally, I'm happy using strings, but if you want to go down the path of using bytes, you simply have to be consistent, and that probably means being platform-dependent anyway, so just use b. for the current directory. Normally, using Unicode strings for file names will work just fine. Any name that you craft yourself will be correctly encoded for the target file system (or UTF-8 if you can't know), and any that you get back from os.listdir or equivalent will be usable in file name contexts. What else can you do with a file name that isn't encoded the way you expect it to be? Unless you have some out-of-band encoding information, you can't do anything meaningful with the stream of bytes, other than keeping it exactly as it is. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
--- Original Message - From: Chris Angelico ros...@gmail.com To: Cc: python-list@python.org python-list@python.org Sent: Saturday, March 7, 2015 6:26 PM Subject: Re: Newbie question about text encoding On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: See: $ mkdir /tmp/xyz $ touch /tmp/xyz/ \x80' $ python3 Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more information. import os os.listdir('/tmp/xyz') ['\udc80'] open(os.listdir('/tmp/xyz')[0]) Traceback (most recent call last): File stdin, line 1, in module FileNotFoundError: [Errno 2] No such file or directory: '\udc80' File names encoded with Latin-X are quite commonplace even in UTF-8 locales. That is not a problem with UTF-8, though. I don't understand how you're blaming UTF-8 for that. There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, ergo the decode to Unicode has to have some special handling of failing bytes. 2) You forgot to put the path on that, so it failed to find the file. Here's my version of your demo: open(/tmp/xyz/+os.listdir('/tmp/xyz')[0]) _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8' Looks fine to me. Alternatively, if you pass a byte string to os.listdir, you get back a list of byte string file names: os.listdir(b/tmp/xyz) [b'\x80'] Nice, I did not know that. And glob.glob works the same way: it returns a list of ustrings when given a ustring, and returns bstrings when given a bstring. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, 07 Mar 2015 19:00:47 +, Mark Lawrence wrote: Isn't pathlib https://docs.python.org/3/library/pathlib.html#module-pathlib effectively a more recent attempt at smoothing or even removing (some of) the bumps? Has anybody here got experience of it as I've never used it? I almost said something about Common Lisp's PATHNAME type, but I didn't. An extremely quick reading of that page tells me that os.pathlib addresses *some* of the issues that PATHNAME addresses, but os.pathlib seems more limited in scope (e.g., os.pathlib doesn't account for filesystems with versioned files). I'll certainly have a closer look later. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Please provide an example; that sounds like a bug. If there is any invalid UTF-8 stream which decodes without an error, it is actually a security bug, and should be fixed pronto in all affected and supported versions. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. I agree with the recommendation, though to be fair to MySQL, it is now possible to store full Unicode. Though personally, I think the whole UTF8MB3 vs UTF8MB4 split is an embarrassment and should be abolished *immediately* - not we may change the meaning of UTF8 to be an alias for UTF8MB4 in the future, just completely abolish the distinction right now. (And deprecate the longer words.) There should be no reason to build any kind of UTF-8 but limited to three bytes encoding for anything. Ever. But at least you can, if you configure things correctly, store any Unicode character in your TEXT field. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Rustom Mody wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. I see you are blaming everyone except the people actually to blame. I don't think you need to blame anybody. I think the UCS-2 mistake was both deplorable and very understandable. At the time it looked like the magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's looked like a hugely expensive price, it was deemed forward-looking to pay it anyway to resolve the character set problem once and for all. Linux was lucky to join the fray late enough to benefit from the bad UCS-2 experience. That said, UTF-8 does suffer badly from its not being a bijective mapping. (Linux didn't quite dodge the bullet with pthreads, threads being another sad fad of the 1990's. The hippies that cooked up the fork system call should be awarded the next Millennium Prize. That foresight or stroke of luck has withstood the challenge of half a century.) But there's nothing wrong with the design of the SMP. It allows the great majority of text, probably 99% or more, to use two bytes (UTF-16) or no more than three bytes (UTF-8), while only relatively specialised uses need four bytes for some code points. The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. It may yet be that UTF-32 replaces all previous schemes since it has all the benefits of ASCII and only one drawback: redundancy. Maybe one day we'll declare the byte 32 bits wide and be done with it. In some many other aspects, 32-bit bytes are the de-facto reality already. Even C coders routinely use 32 bits to express boolean values. And when Roy's customers demand that his product support emoji, or complain that they cannot spell their own name because of his parochial and ignorant idea of crap, perhaps he will consider doing what he should have done from the beginning: That's a recurring theme: Why didn't we do IPv6 from the get-go? Why didn't we do multi-user from the get-go? Why didn't we do localization from the get-go? There comes a point when you have to release to start making money. You then suffer the consequences until your company goes bankrupt. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 12:02, Chris Angelico wrote: On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote: The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. So... use Pike, or Python 3.3+? ChrisA Cue obligatory cobblers from our RUE. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 07/03/2015 11:09, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Methinks somebody has been drinking too much loony juice. Either that or taking too much notice of our RUE. Not that I've done a proper analysis, but to my knowledge there's nothing like the number of issues on the bug tracker for Unicode bugs for Python 3 compared to Python 2. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? As far as I am aware, every code point has one and only one valid UTF-8 encoding, and every UTF-8 encoding has one and only one valid code point. There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule that valid UTF-8 encodings are the shortest possible. E.g. SMP code points should be encoded to four bytes using UTF-8: py u'\U0010FF01'.encode('utf-8') # U+10FF01 '\xf4\x8f\xbc\x81' But in CESU-8, the code point is first interpreted as a UTF-16 surrogate pair: py u'\U0010FF01'.encode('utf-16be') '\xdb\xff\xdf\x01' then each surrogate pair is treated as a 16-bit code unit and individually encoded to three bytes using UTF-8: py u'\udbff'.encode('utf-8') '\xed\xaf\xbf' py u'\udf01'.encode('utf-8') '\xed\xbc\x81' giving six bytes in total: '\xed\xaf\xbf\xed\xbc\x81' This is not UTF-8! But some software mislabels it as UTF-8. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Rustom Mody wrote: On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: [...] Chris is suggesting that going from BMP to all of Unicode is not the hard part. Going from ASCII to the BMP part of Unicode is the hard part. If you can do that, you can go the rest of the way easily. Depends where the going is starting from. I specifically names Java, Javascript, Windows... among others. Here's some quotes from the supplementary chars doc of Java http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html | Supplementary characters are characters in the Unicode standard whose | code points are above U+, and which therefore cannot be described as | single 16-bit entities such as the char data type in the Java | programming language. Such characters are generally rare, but some are | used, for example, as part of Chinese and Japanese personal names, and | so support for them is commonly required for government applications in | East Asian countries... | The introduction of supplementary characters unfortunately makes the | character model quite a bit more complicated. | Unicode was originally designed as a fixed-width 16-bit character | encoding. The primitive data type char in the Java programming language | was intended to take advantage of this design by providing a simple data | type that could hold | any character Version 5.0 of the J2SE is required to support | version 4.0 of the Unicode standard, so it has to support supplementary | characters. My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. I see you are blaming everyone except the people actually to blame. It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years ago, the same year as 1.0 release of Java. Java has had eight major new releases since then. Oracle, and Sun before them, are/were serious, tier-1, world-class major IT companies. Why haven't they done something about introducing proper support for Unicode in Java? It's not hard -- if Python can do it using nothing but volunteers, Oracle can do it. They could even do it in a backwards-compatible way, by leaving the existing APIs in place and adding new APIs. As for Microsoft, as a member of the Unicode Consortium they have no excuse. But I think you exaggerate the lack of support for SMPs in Windows. Some parts of Windows have no SMP support, but they tend to be the oldest and less important (to Microsoft) parts, like the command prompt. Anyone have Powershell and like to see how well it supports SMP? This Stackoverflow question suggests that post-Windows 2000, the Windows file system has proper support for code points in the supplementary planes: http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua Or maybe not. Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant. Glad you agree about that much at least. [...] Conclusion: faulty implementations of UTF-16 which incorrectly handle surrogate pairs should be replaced by non-faulty implementations, or changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be upgraded. Imagine for a moment a thought experiment -- we are not on a python but a java forum and please rewrite the above para. There is no need to re-write it. If Java's only implementation of Unicode assumes that code points are 16 bits only, then Java needs a new Unicode implementation. (I assume that the existing one cannot be changed for backwards-compatibility reasons.) Are you addressing the vanilla java programmer? Language implementer? Designer? The Java-funders -- earlier Sun, now Oracle? The last three should be considered the same people. The vanilla Java programmer is not responsible for the short-comings of Java's implementation. [...] In practice, standards change. However if a standard changes so frequently that that users have to play catching cook and keep asking: Which version? they are justified in asking Are the standard-makers doing due diligence? Since Unicode has stability
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote: The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. So... use Pike, or Python 3.3+? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote: On 3/6/2015 11:20 AM, Rustom Mody wrote: = pp = print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File /usr/bin/idle3, line 5, in module main() File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main if flist.open(filename) is None: File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open edit = self.EditorWindow(self, filename, key) File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__ EditorWindow.__init__(self, *args) File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__ if io.loadfile(filename): File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile self.text.insert(1.0, chars) File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert self.top.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert self.addcmd(InsertCommand(index, chars, tags)) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd cmd.do(self.delegate) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do text.insert(self.index1, self.chars, self.tags) File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert self.delegate.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in __call__ return self.tk_call(self.orig_and_operation + args) _tkinter.TclError: character U+1f4a9 is above the range (U+-U+) allowed by Tcl So who/what is broken? tcl The possible workaround is for Idle to translate to \U0001f4a9 (10 chars) before sending it to tk. But some perspective. In the console interpreter: print(\U0001f4a9) Traceback (most recent call last): File stdin, line 1, in module File C:\Programs\Python34\lib\encodings\cp437.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' in posit ion 0: character maps to undefined So what is broken? Windows Command Prompt. More perspective. tk/Idle *will* print *something* for every BMP char. Command Prompt will not. It does not even do ucs-2 correctly. So which is more broken? Windows Command Prompt. Who has perhaps 1,000,000 times more resources, Microsoft? or the tcl/tk group? I think we all know. Thanks Terry for the perspective. From my side: No complaints about python or tcl (or idle -- its actually neater than emacs if only emacs was not burnt into my nervous system) Even unicode -- only marginal complaints. I wrote http://blog.languager.org/2015/02/universal-unicode.html precisely to say that unicode is a wonderful thing and one should be enthusiastic about it. [You got that better than anyone else who has spoken -- Thanks] Xah's pages are way more comprehensive than mine. But comprehensive can be a negative -- ultimately the unicode standard is the most comprehensive and correspondingly impenetrable without a compass. The only very minor complaint I would make is: If idle is unable to deal with SMP-chars and this is known and unlikely to change (until TK changes), why not put up a dialog of the sort: SMP char on line nn SMP support currently unimplemented -- Sorry instead of a backtrace? [As I said just a suggestion] -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote: On 07/03/2015 17:16, Marko Rauhamaa wrote: Mark Lawrence: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko You've been a PITA ever since you first joined this list, what about it? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Hi Mark Your UK accent above is funny [At least *I* find it so] The above however is crossing a line. Please desist. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Are you talking about the fact that not all byte streams are valid UTF-8? That is, some byte objects b may raise an exception on b.decode('utf-8'). I don't see why that means UTF-8 suffers badly from this. Can you give an example of where you would expect to take an arbitrary byte-stream, decode it as UTF-8, and expect the results to be meaningful? For those cases where you do wish to take an arbitrary byte stream and round-trip it, Python now provides an error handler for that. py import random py b = bytes([random.randint(0, 255) for _ in range(1)]) py s = b.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0: invalid start byte py s = b.decode('utf-8', errors='surrogateescape') py s.encode('utf-8', errors='surrogateescape') == b True -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 1:03 AM, random...@fastmail.us wrote: On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them to X code points is the safest way (and preferably do an NFC or NFD normalization before counting, for consistency); Why are you length-limiting it? Storage space? Limit it in whatever encoding they're stored in. Why are combining marks pathological but surrogate characters not? Display space? Limit it by columns. If you're going to allow a Japanese user's name to be twice as wide, you've got a problem when you go to display it. To prevent people from putting three paragraphs of lipsum in and calling it a username. this means you disallow pathological cases where every base character has innumerable combining marks added. No it doesn't. If you limit it to, say, fifty, someone can still post two base characters with twenty combining marks each. If you actually want to disallow this, you've got to do more work. You've disallowed some of the pathological cases, some of the time, by coincidence. And limiting the number of UTF-8 bytes, or the number of UTF-16 code points, will accomplish this just as well. They can, but then they're limited to two base characters. They can't have fifty base characters with twenty combining marks each. That's the point. Now, if you intend to _silently truncate_ it to the desired length, you certainly don't want to leave half a character in, of course. But who's to say the base character plus first few combining marks aren't also half a character? If you're _splitting_ a string, rather than merely truncating it, you probably don't want those combining marks at the beginning of part two. So you truncate to the desired length, then if the first character of the trimmed-off section is a combining mark (based on its Unicode character types), you keep trimming until you've removed a character which isn't. Then, if you no longer have any content whatsoever, reject the name. Simple. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote: To prevent people from putting three paragraphs of lipsum in and calling it a username. Limiting by UTF-8 bytes or UTF-16 units works just as well for that. So you truncate to the desired length, then if the first character of the trimmed-off section is a combining mark (based on its Unicode character types), you keep trimming until you've removed a character which isn't. Then, if you no longer have any content whatsoever, reject the name. Simple. My entire point was that UTF-32 doesn't save you from that, so it cannot be called a deficiency of UTF-16. My point is there are very few problems to which count of Unicode code points is the only right answer - that UTF-32 is good enough for but that are meaningfully impacted by a naive usage of UTF-16, to the point where UTF-16 is something you have to be safe from. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time somebody cried Strawman! when what they really should say is Yes, that's a good argument, I'm afraid I can't argue against it, at least not without considerable thought, I'd be a wealthy man... If I had a dollar for every time anyone said If I had insert currency unit here for every time..., I'd go meta all day long and profit from it... :) - If you are writing your own file system layer, it's 2015 fer fecks sake, file names should be Unicode strings, not bytes! (That's one part of the Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file system, whichever you please, but again remember that both are variable-width formats. I agree that that part of the Unix model needs to change, but there are two viable ways to move forward: 1) Keep file names as bytes, but mandate that they be valid UTF-8 streams, and recommend that they be decoded UTF-8 for display to a human 2) Change the entire protocol stack from the file system upwards so that file names become Unicode strings. Trouble with #2 is that file names need to be passed around somehow, which means bytes in memory. So ultimately, #2 really means keep file names as bytes, and mandate an encoding all the way up the stack... so it's a massive documentation change that really comes down to the same thing as #1. This is one area where, as I understand it, Mac OS got it right. It's time for other Unix variants to adopt the same policy. The bulk of file names will be ASCII-only anyway, so requiring UTF-8 won't affect them; a lot of others are already UTF-8; so all we need is a transition scheme for the remaining ones. If there's a known FS encoding, it ought to be possible to have a file system conversion tool that goes through everything, decodes, re-encodes UTF-8, and then flags the file system as UTF-8 compliant. All that'd be left would be the file names that are broken already - ones that don't decode in the FS encoding - and there's nothing to be done with them but wrap them up into something probably-meaningless-but reversible. When can we start doing this? ext5? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them to X code points is the safest way (and preferably do an NFC or NFD normalization before counting, for consistency); Why are you length-limiting it? Storage space? Limit it in whatever encoding they're stored in. Why are combining marks pathological but surrogate characters not? Display space? Limit it by columns. If you're going to allow a Japanese user's name to be twice as wide, you've got a problem when you go to display it. this means you disallow pathological cases where every base character has innumerable combining marks added. No it doesn't. If you limit it to, say, fifty, someone can still post two base characters with twenty combining marks each. If you actually want to disallow this, you've got to do more work. You've disallowed some of the pathological cases, some of the time, by coincidence. And limiting the number of UTF-8 bytes, or the number of UTF-16 code points, will accomplish this just as well. Now, if you intend to _silently truncate_ it to the desired length, you certainly don't want to leave half a character in, of course. But who's to say the base character plus first few combining marks aren't also half a character? If you're _splitting_ a string, rather than merely truncating it, you probably don't want those combining marks at the beginning of part two. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time somebody cried Strawman! when what they really should say is Yes, that's a good argument, I'm afraid I can't argue against it, at least not without considerable thought, I'd be a wealthy man... Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? The UTF-16 standard is not broken. It is a perfectly adequate variable-width encoding, and considerably better than most other variable-width encodings. However, many implementations of UTF-16 are faulty, and assume a fixed-width. *That* is broken, not UTF-16. (The difference between specification and implementation is critical.) - Is it widely used or not? It's quite widely used. - Should programmers be careful of it or not? Programmers should be aware whether or not any specific language uses UTF-16 and whether the implementation is buggy. That will help them decide whether or not to use that language. - Should programmers be warned about it or not? I'm in favour of people having more knowledge rather than less. I don't believe that ignorance is bliss, except perhaps in the case that a giant asteroid the size of Texas is heading straight for us. Programmers should be aware of the limitations or bugs in any UTF-16 implementation they are likely to run into. Hence my general recommendation: - For transmission over networks or storage on permanent media (e.g. the content of text files), use UTF-8. It is well-implemented by nearly all languages that support Unicode, as far as I know. - If you are designing your own language, your implementation of Unicode strings should use something like Python's FSR, or UTF-8 with tweaks to make string indexing O(1) rather than O(N), or correctly-implemented UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte per code point format, you fail. - If you are using an existing language, be aware of any bugs and limitations in its Unicode implementation. You may or may not be able to work around them, but at least you can decide whether or not you wish to try. - If you are writing your own file system layer, it's 2015 fer fecks sake, file names should be Unicode strings, not bytes! (That's one part of the Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file system, whichever you please, but again remember that both are variable-width formats. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody rustompm...@gmail.com wrote: Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? Strawman. Not really, no. I know of lots of programs that can't handle embedded NULs, and which fail in various ways when given them (the most common is simple truncation, but it's by far not the only way). And it's exactly the same: a program that purports to handle arbitrary Unicode text should be able to handle arbitrary Unicode text, not Unicode text as long as it contains only codepoints within the range X-Y. It doesn't matter whether the code chokes on U+, U+005C, U+FFFC, or U+1F4A3 - if your code blows up, it's a failure in your code. Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not? No, UTF-16 is not itself broken. (It would be if we expected codepoints 0x10, and it's because of UTF-16 that that's the cap on Unicode, but it's looking unlikely that we'll be needing any more than that anyway.) What's broken is code that tries to treat UTF-16 as if it's UCS-2, and then breaks on surrogate pairs. Yes, it's widely used. Programmers should probably be warned about it, but only because its tradeoffs are generally poorer than UTF-8's. If you use it correctly, there's no problem. Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over? I don't know what you mean here. Do you mean that your Python 3 program is at risk in some way because there might be some other program that misuses UTF-16? Well, sure. And there might be some other program that misuses buffer sizes, SQL queries, or shell invocations, and makes your overall system vulnerable to buffer overruns or injection attacks. These are significantly more likely AND more serious than UTF-16 misuses. And you still have not proven anything about SMP characters being a problem, but only that code can be broken. Broken code is still broken code, no matter what your actual brokenness. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote: Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over? The most common failure of UTF-16 support, supposedly, is in programs misusing the number of code units (for length or random access) as a proxy for the number of characters. However, when do you _really_ want the number of characters? You may want to use it for, for example, the number of columns in a 'monospace' font, which you've already screwed up because you haven't accounted for double-wide characters or combining marks. Or you may want the position that pressing an arrow key or backspace or forward-delete a number of times will reach, which has its own rules in e.g. Indic languages (and also fails on Latin with combining marks). -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 12:33 AM, random...@fastmail.us wrote: However, when do you _really_ want the number of characters? You may want to use it for, for example, the number of columns in a 'monospace' font, which you've already screwed up because you haven't accounted for double-wide characters or combining marks. Or you may want the position that pressing an arrow key or backspace or forward-delete a number of times will reach, which has its own rules in e.g. Indic languages (and also fails on Latin with combining marks). Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them to X code points is the safest way (and preferably do an NFC or NFD normalization before counting, for consistency); this means you disallow pathological cases where every base character has innumerable combining marks added. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
random...@fastmail.us wrote: My point is there are very few problems to which count of Unicode code points is the only right answer - that UTF-32 is good enough for but that are meaningfully impacted by a naive usage of UTF-16, to the point where UTF-16 is something you have to be safe from. I'm not sure why you care about the count of Unicode code points, although that *is* a problem. Not for end-user reasons like how long is my password?, but because it makes your job as a programmer harder. [steve@ando ~]$ python2.7 -c print (len(u'\U:\U00014445')) 4 [steve@ando ~]$ python3.3 -c print (len(u'\U:\U00014445')) 3 It's hard to reason about your code when something as fundamental as the length of a string is implementation-dependent. (By the way, the right answer should be 3, not 4.) But an even more important problem is that broken-UTF-16 lets you create invalid, impossible Unicode strings *by accident*. Naturally you can create broken Unicode if you assemble strings of surrogates yourself, but broken-UTF-16 means it can happen from otherwise innocuous operations like reversing a string: py s = u'\U:\U00014445' # Python 2.7 narrow build py s[::-1] u'\udc45\ud811:\u' It's hard for me to demonstrate that the reversed string is broken because the shell I am using does an amazingly good job of handling broken Unicode. Even if I print it, the shell just prints missing-character glyphs instead of crashing (fortunately for me!). But the first two code points are in illegal order: \udc45 is a high surrogate, and must follow a low surrogate; \ud811 is a low surrogate, and must precede a high surrogate; I'm not convinced you should be allowed to create Unicode strings containing mismatched surrogates like this deliberately, but you certainly shouldn't be able to do so by accident. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time somebody cried Strawman! when what they really should say is Yes, that's a good argument, I'm afraid I can't argue against it, at least not without considerable thought, I'd be a wealthy man... Missed my addition? Here it is again – grammar slightly corrected. === Ah well if you insist on pursuing the nul-char example... - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0 - No, the code that can't cope with a perfectly normal character is not wrong - It is C that is wrong for designing a buggy string data structure that cannot contain a valid char. === In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 – it is perhaps too strong even for me. To elaborate: Take the buggy-plane analogy I gave in http://blog.languager.org/2015/03/whimsical-unicode.html If a plane model crashes once in 10,000 flights compared to others that crash once in one million flights we can call it bug-prone though not strictly buggy – it does fly times safely! OTOH if a plane is guaranteed to crash we can all it a buggy plane. C's string is not bug-prone its plain buggy as it cannot represent strings with nulls. I would not go that far for UTF-16. It is bug-inviting but it can also be implemented correctly Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? The UTF-16 standard is not broken. It is a perfectly adequate variable-width encoding, and considerably better than most other variable-width encodings. However, many implementations of UTF-16 are faulty, and assume a fixed-width. *That* is broken, not UTF-16. (The difference between specification and implementation is critical.) - Is it widely used or not? It's quite widely used. - Should programmers be careful of it or not? Programmers should be aware whether or not any specific language uses UTF-16 and whether the implementation is buggy. That will help them decide whether or not to use that language. - Should programmers be warned about it or not? I'm in favour of people having more knowledge rather than less. I don't believe that ignorance is bliss, except perhaps in the case that a giant asteroid the size of Texas is heading straight for us. Programmers should be aware of the limitations or bugs in any UTF-16 implementation they are likely to run into. Hence my general recommendation: - For transmission over networks or storage on permanent media (e.g. the content of text files), use UTF-8. It is well-implemented by nearly all languages that support Unicode, as far as I know. - If you are designing your own language, your implementation of Unicode strings should use something like Python's FSR, or UTF-8 with tweaks to make string indexing O(1) rather than O(N), or correctly-implemented UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) FSR is possible in python for very specific pythonic reasons - dynamicness - immutable strings Drop either and FSR is impossible If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte per code point format, you fail. Seems obvious enough. So lets see... Here's a 2-line python program -- runs well enough when run as a command. Program: = pp = print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File /usr/bin/idle3, line 5, in module main() File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main if flist.open(filename) is None: File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open edit = self.EditorWindow(self, filename, key) File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__ EditorWindow.__init__(self, *args) File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__ if io.loadfile(filename): File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile self.text.insert(1.0, chars) File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert self.top.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert self.addcmd(InsertCommand(index, chars, tags)) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd cmd.do(self.delegate) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do text.insert(self.index1, self.chars, self.tags) File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert self.delegate.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in __call__ return
Re: Newbie question about text encoding
On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody rustompm...@gmail.com wrote: C's string is not bug-prone its plain buggy as it cannot represent strings with nulls. I would not go that far for UTF-16. It is bug-inviting but it can also be implemented correctly C's standard library string handling functions are restricted in that they handle a 255-byte alphabet. They do not handle Unicode, they do not handle NUL, that is simply how they are. But I never said I was talking about the C standard library. If you type a text string into a GUI entry field, or encode it quoted-printable and pass it to a web server, or whatever, you shouldn't know or care about what language the program is written in; and if that program barfs on a NUL, that's a limitation. That limitation might be caused by its naive use of strcpy() when it should have used memcpy(), but that's not your problem. It's exactly the same here: if your program chokes on an SMP character, I don't care what your program was written in or what library functions your program called on. All I care is that your program - repeated for emphasis, *your* program - failed on that input. It's up to you to choose your underlying functions appropriately. - If you are designing your own language, your implementation of Unicode strings should use something like Python's FSR, or UTF-8 with tweaks to make string indexing O(1) rather than O(N), or correctly-implemented UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) FSR is possible in python for very specific pythonic reasons - dynamicness - immutable strings Drop either and FSR is impossible I don't know what you mean by dynamicness. What you do need is a Unicode string type, such that the application program isn't aware of the underlying bytes, but simply treats this string as a sequence of code points. The immutability isn't technically a requirement, but it does make the FSR much more manageable; in a language with mutable strings, it's probably more efficient to use UTF-32 for simplicity, but it's up to the language designer to figure that out. (It might be best to use something like the FSR, but where strings are never narrowed after being widened, so it'd be possible for an ASCII-only string to be stored UTF-32. That has consequences for comparisons, but might give a reasonable hybrid of storage and mutation performance.) _tkinter.TclError: character U+1f4a9 is above the range (U+-U+) allowed by Tcl So who/what is broken? The exception is pretty clear on that point. Tcl can't handle SMP characters. So it's Tcl that's broken. Unless there's evidence to the contrary, that's what I would expect to be the case. Correct. Windows is broken for using UTF-16 Linux is broken for conflating UTF-8 and byte string. Lot of breakage out here dont you think? May be related to the equation UTF-16 = UCS-2 + Duct-tape UTF-16 is an encoding that was designed to be backward-compatible with UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it what you will, but backward compatibility is pretty important. Look at things like DES3 - if you use the same key three times, it's compatible with DES. Linux isn't broken for conflating UTF-8 and byte strings. Linux is flawed in that it defines file names to be byte strings, which means that every file system could be different in what it actually uses as the encoding. Since file names exist for the benefit of humans, they should be treated as text, so we should work with them as text. But for reasons of backward compatibility, Linux hasn't yet changed. Windows isn't broken for using UTF-16. I think it's a poor trade-off, given that so many file names are ASCII-only; and, of course, if any program treats a Windows file name as UCS-2, then that program is broken. But UTF-16 is not itself broken, any more than UTF-7 is. And UTF-7 is a lot harder to work with. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote: Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not? Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over? -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. For most of the 1990s, I was writing code in REXX, on OS/2. An even earlier adopter, REXX didn't have Unicode support _at all_, but instead had facilities for working with DBCS strings. You can't get everything right AND be the first to produce anything. Python didn't make Unicode strings the default until 3.0, but that's not Unicode's fault. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. Here is Roy's Smith post that first started me thinking that something may be wrong with SMP https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, breaks mysql. So whole system broke due to those 4 in 200,000,000 records I know enough (or not enough) of unicode to be chary of statistical conclusions from the above. My conclusion is essentially an 'existence-proof': Hang on hang on. Why are you blaming Python or SMP characters for this? The problem here is MySQL, which doesn't adequately cope with the full Unicode range. (Or, didn't then, or doesn't with its default settings. I believe you can configure current versions of MySQL to work correctly, though I haven't actually checked. PostgreSQL gets it right, that's good enough for me.) SMP-chars can break systems. The breakage is costly-fied by the combination - layman statistical assumptions - BMP → SMP exercises different code-paths Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? Strawman. Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not? -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote: On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? Strawman. Not really, no. I know of lots of programs that can't handle embedded NULs, and which fail in various ways when given them (the most common is simple truncation, but it's by far not the only way). Ah well if you insist on pursuing the nul-char example... No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0 Nor the code that can't cope with a perfectly normal character? But with C for having a data structure called string with a 'hole' in it. And it's exactly the same: a program that purports to handle arbitrary Unicode text should be able to handle arbitrary Unicode text, not Unicode text as long as it contains only codepoints within the range X-Y. It doesn't matter whether the code chokes on U+, U+005C, U+FFFC, or U+1F4A3 - if your code blows up, it's a failure in your code. Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not? No, UTF-16 is not itself broken. (It would be if we expected codepoints 0x10, and it's because of UTF-16 that that's the cap on Unicode, but it's looking unlikely that we'll be needing any more than that anyway.) What's broken is code that tries to treat UTF-16 as if it's UCS-2, and then breaks on surrogate pairs. Yes, it's widely used. Programmers should probably be warned about it, but only because its tradeoffs are generally poorer than UTF-8's. If you use it correctly, there's no problem. Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over? I don't know what you mean here. Do you mean that your Python 3 program is at risk in some way because there might be some other program that misuses UTF-16? Yes some other program/library/API etc connected to the python one Well, sure. And there might be some other program that misuses buffer sizes, SQL queries, or shell invocations, and makes your overall system vulnerable to buffer overruns or injection attacks. These are significantly more likely AND more serious than UTF-16 misuses. And you still have not proven anything about SMP characters being a problem, but only that code can be broken. Broken code is still broken code, no matter what your actual brokenness. Roy Smith (and many other links Ive cited) prove exactly that - an SMP character broke the code. Note: I have no objection to people supporting full unicode 7. Im just saying it may be significantly harder than just Use python3 and you are done -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 3/6/2015 11:20 AM, Rustom Mody wrote: = pp = print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File /usr/bin/idle3, line 5, in module main() File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main if flist.open(filename) is None: File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open edit = self.EditorWindow(self, filename, key) File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__ EditorWindow.__init__(self, *args) File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__ if io.loadfile(filename): File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile self.text.insert(1.0, chars) File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert self.top.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert self.addcmd(InsertCommand(index, chars, tags)) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd cmd.do(self.delegate) File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do text.insert(self.index1, self.chars, self.tags) File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert self.delegate.insert(index, chars, tags) File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in __call__ return self.tk_call(self.orig_and_operation + args) _tkinter.TclError: character U+1f4a9 is above the range (U+-U+) allowed by Tcl So who/what is broken? tcl The possible workaround is for Idle to translate to \U0001f4a9 (10 chars) before sending it to tk. But some perspective. In the console interpreter: print(\U0001f4a9) Traceback (most recent call last): File stdin, line 1, in module File C:\Programs\Python34\lib\encodings\cp437.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' in posit ion 0: character maps to undefined So what is broken? Windows Command Prompt. More perspective. tk/Idle *will* print *something* for every BMP char. Command Prompt will not. It does not even do ucs-2 correctly. So which is more broken? Windows Command Prompt. Who has perhaps 1,000,000 times more resources, Microsoft? or the tcl/tk group? I think we all know. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. UTF-8 is already restricted from representing values above 0x10, whereas UTF-8 can naturally represent values up to 0x1F in four bytes, up to 0x3FF in five bytes, and 0x7FFF in six bytes. If anything, the BMP represents a natural boundary, since it coincides with values that can be represented in three bytes. Likewise, UTF-32 can obviously represent values up to 0x. You're programming in artificial restrictions either way, it's just a question of what those restrictions are. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
random...@fastmail.us wrote: On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. UTF-8 is already restricted from representing values above 0x10, whereas UTF-8 can naturally represent values up to 0x1F in four bytes, up to 0x3FF in five bytes, and 0x7FFF in six bytes. If anything, the BMP represents a natural boundary, since it coincides with values that can be represented in three bytes. Likewise, UTF-32 can obviously represent values up to 0x. You're programming in artificial restrictions either way, it's just a question of what those restrictions are. Good points, but they don't greatly change my conclusion. If you are implementing UTF-8 or UTF-32, it is no harder to deal with code points in the SMP than those in the BMP. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Rustom Mody wrote: On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the 3-way - ASCII - BMP - Unicode How is that more useful? Aside from storage optimizations (in which the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is not significantly different from the rest of Unicode. Sorry... Dont understand. Chris is suggesting that going from BMP to all of Unicode is not the hard part. Going from ASCII to the BMP part of Unicode is the hard part. If you can do that, you can go the rest of the way easily. I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. UTF-16 is different, and that's probably why you think supporting all of Unicode is hard. With UTF-16, there really is an obvious distinction between the BMP and the SMP: that's where you jump from a single 2-byte unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8 or UTF-32: - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you support the SMP or not doesn't change the fact that you have to deal with multi-byte characters. - In UTF-32, everything is fixed-width whether it is in the BMP or not. In both cases, supporting the SMPs is no harder than supporting the BMP. It's only UTF-16 that makes the SMP seem hard. Conclusion: faulty implementations of UTF-16 which incorrectly handle surrogate pairs should be replaced by non-faulty implementations, or changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be upgraded. Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new standard that is just like obsolete Unicode version 1. Unicode version 1 is obsolete for a reason. 16 bits is not enough for even existing languages, let alone all the code points and characters that are used in human communication. Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why do you keep talking about 7.0 as if it's a recent change? It is 2015 as of now. 7.0 is the current standard. The need for the adjective 'current' should be pondered upon. What's your point? The UTF encodings have not changed since they were first introduced. They have been stable for at least twenty years: UTF-8 has existed since 1993, and UTF-16 since 1996. Since version 2.0 of Unicode in 1996, the standard has made stability guarantees that no code points will be renamed or removed. Consequently, there has only been one version which removed characters, version 1.1. Since then, new versions of the standard have only added characters, never moved, renamed or deleted them. http://unicode.org/policies/stability_policy.html Some highlights in Unicode history: Unicode 1.0 (1991): initial version, defined 7161 code points. In January 1993, Rob Pike and Ken Thompson announced the design and working implementation of the UTF-8 encoding. 1.1 (1993): defined 34233 characters, finalised Han Unification. Removed some characters from the 1.0 set. This is the first and only time any code points have been removed. 2.0 (1996): First version to include code points in the Supplementary Multilingual Planes. Defined 38950 code points. Introduced the UTF-16 encoding. 3.1 (2001): Defined 94205 code points, including 42711 additional Han ideographs, bringing the total number of CJK code points alone to 71793, too many to fit in 16 bits. 2006: The People's Republic Of China mandates support for the GB-18030 character set for all software products sold in the PRC. GB-18030 supports the entire Unicode range, include the SMPs. Since this date, all software sold in China must support the SMPs. 6.0 (2010): The first emoji or emoticons were added to Unicode. 7.0 (2014): 113021 code points defined in total. In practice, standards change. However if a standard changes so frequently that that users have to play catching cook and keep asking: Which version? they are justified in asking Are the standard-makers doing due diligence? Since Unicode has stability guarantees, and the encodings have not changed in twenty years and will not change in the future, this argument is bogus. Updating to a new version of the standard means, to a first approximation, merely allocating some new code points which had previously been undefined but are now defined. (Code points can be flagged deprecated, but they will never be removed.) -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody rustompm...@gmail.com wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. For most of the 1990s, I was writing code in REXX, on OS/2. An even earlier adopter, REXX didn't have Unicode support _at all_, but instead had facilities for working with DBCS strings. You can't get everything right AND be the first to produce anything. Python didn't make Unicode strings the default until 3.0, but that's not Unicode's fault. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. Here is Roy's Smith post that first started me thinking that something may be wrong with SMP https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, breaks mysql. So whole system broke due to those 4 in 200,000,000 records I know enough (or not enough) of unicode to be chary of statistical conclusions from the above. My conclusion is essentially an 'existence-proof': Hang on hang on. Why are you blaming Python or SMP characters for this? The problem here is MySQL, which doesn't adequately cope with the full Unicode range. (Or, didn't then, or doesn't with its default settings. I believe you can configure current versions of MySQL to work correctly, though I haven't actually checked. PostgreSQL gets it right, that's good enough for me.) SMP-chars can break systems. The breakage is costly-fied by the combination - layman statistical assumptions - BMP → SMP exercises different code-paths Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or the code that can't cope with a perfectly normal character? You could also choose do with astral crap (Roy's words) what we all do with crap -- throw it out as early as possible. There's only one character that fits that description, and that's 1F4A9. Everything else is just astral characters, and you shouldn't have any difficulties with them. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the 3-way - ASCII - BMP - Unicode How is that more useful? Aside from storage optimizations (in which the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is not significantly different from the rest of Unicode. Sorry... Dont understand. Chris is suggesting that going from BMP to all of Unicode is not the hard part. Going from ASCII to the BMP part of Unicode is the hard part. If you can do that, you can go the rest of the way easily. Depends where the going is starting from. I specifically names Java, Javascript, Windows... among others. Here's some quotes from the supplementary chars doc of Java http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html | Supplementary characters are characters in the Unicode standard whose code | points are above U+, and which therefore cannot be described as single | 16-bit entities such as the char data type in the Java programming language. | Such characters are generally rare, but some are used, for example, as part | of Chinese and Japanese personal names, and so support for them is commonly | required for government applications in East Asian countries... | The introduction of supplementary characters unfortunately makes the | character model quite a bit more complicated. | Unicode was originally designed as a fixed-width 16-bit character encoding. | The primitive data type char in the Java programming language was intended to | take advantage of this design by providing a simple data type that could hold | any character Version 5.0 of the J2SE is required to support version 4.0 | of the Unicode standard, so it has to support supplementary characters. My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use characters and some only (the equivalent of) what we call words. Or you can skip the blame-game and simply note the fact that large segments of extant code-bases are currently in bug-prone or plain buggy state. This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant. Large segments of the UTF-16 is different, and that's probably why you think supporting all of Unicode is hard. With UTF-16, there really is an obvious distinction between the BMP and the SMP: that's where you jump from a single 2-byte unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8 or UTF-32: - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you support the SMP or not doesn't change the fact that you have to deal with multi-byte characters. - In UTF-32, everything is fixed-width whether it is in the BMP or not. In both cases, supporting the SMPs is no harder than supporting the BMP. It's only UTF-16 that makes the SMP seem hard. Conclusion: faulty implementations of UTF-16 which incorrectly handle surrogate pairs should be replaced by non-faulty implementations, or changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be upgraded. Imagine for a moment a thought experiment -- we are not on a python but a java forum and please rewrite the above para. Are you addressing the vanilla java programmer? Language implementer? Designer? The Java-funders -- earlier Sun, now Oracle? Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new standard that is just like obsolete Unicode version 1. Unicode version 1 is obsolete for a reason. 16 bits is not enough for even existing languages, let alone all the code points and characters that are used in human communication. Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why do you keep talking about 7.0 as if it's a recent change? It is 2015 as of now. 7.0 is the current standard. The need for the adjective 'current' should be pondered upon. What's your point? The UTF encodings have not changed since they were first introduced. They have been stable for at least twenty years: UTF-8 has
Re: Newbie question about text encoding
On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody rustompm...@gmail.com wrote: What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] Re footnote #4: ½ is a single character for compatibility reasons. ⅟₁₀₀ doesn't need to be a single character, because there are countably infinite vulgar fractions and only 0x11 Unicode characters. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I think that the main point of the post, that many Unicode chars are truly planetary rather than just national/regional, is excellent. snipped You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 3/3/2015 1:03 PM, Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] I agree with both. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the 3-way - ASCII - BMP - Unicode How is that more useful? Aside from storage optimizations (in which the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is not significantly different from the rest of Unicode. Sorry... Dont understand. Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why do you keep talking about 7.0 as if it's a recent change? It is 2015 as of now. 7.0 is the current standard. The need for the adjective 'current' should be pondered upon. In practice, standards change. However if a standard changes so frequently that that users have to play catching cook and keep asking: Which version? they are justified in asking Are the standard-makers doing due diligence? -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote: What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] Re footnote #4: ½ is a single character for compatibility reasons. ⅟₁₀₀ ... ^^^ Neat Thanks [And figured out some of quopri module along the way figuring that out] -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I think that the main point of the post, that many Unicode chars are truly planetary rather than just national/regional, is excellent. snipped You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html I consider it unethical to make semantic changes to a published work in place without acknowledgement. Fixing minor typos or spelling errors, or dead links, is okay. But any edit that changes the meaning should be commented on, either by an explicit note on the page itself, or by striking out the previous content and inserting the new. As for the content of the essay, it is currently rather unfocused. It appears to be more of a list of here are some Unicode characters I think are interesting, divided into subgroups, oh and here are some I personally don't have any use for, which makes them silly than any sort of discussion about the universality of Unicode. That makes it rather idiosyncratic and parochial. Why should obscure maths symbols be given more importance than obscure historical languages? I think that the universality of Unicode could be explained in a single sentence: It is the aim of Unicode to be the one character set anyone needs to represent every character, ideogram or symbol (but not necessarily distinct glyph) from any existing or historical human language. I can expand on that, but in a nutshell that is it. You state: APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset ... but I don't think that is correct. I'm pretty sure that neither APL nor Z allowed you to define new characters. They might not have used ASCII alone, but they still had a restricted character set. It was merely less restricted than ASCII. You make a comment about Cobol's relative unpopularity, but (1) Cobol doesn't require you to write out numbers as English words, and (2) Cobol is still used, there are uncounted billions of lines of Cobol code being used, and if the number of Cobol programmers is less now than it was 16 years ago, there are still a lot of them. Academics and FOSS programmers don't think much of Cobol, but it has to count as one of the most amazing success stories in the field of programming languages, despite its lousy design. You list ideographs such as Cuneiform under Icons. They are not icons. They are a mixture of symbols used for consonants, syllables, and logophonetic, consonantal alphabetic and syllabic signs. That sits them firmly in the same categories as modern languages with consonants, ideogram languages like Chinese, and syllabary languages like Cheyenne. Just because native readers of Cuneiform are all dead doesn't make Cuneiform unimportant. There are probably more people who need to write Cuneiform than people who need to write APL source code. You make a comment: To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these?? But that is nonsense, and it contradicts your earlier quoting of Dave Angel. Why are you so worried about an (illusionary) minor optimization? Whether code points are allocated or not doesn't affect how much space they take up. There are millions of unused Unicode code points today. If they are allocated tomorrow, the space your documents take up will not increase one byte. Allocating code points to Cuneiform has not increased the space needed by Unicode at all. Two bytes alone is not enough for even existing human languages (thanks China). For hardware related reasons, it is faster and more efficient to use four bytes than three, so the obvious and dumb (in the simplest thing which will work) way to store Unicode is UTF-32, which takes a full four bytes per code point, regardless of whether there are 65537 code points or 1114112. That makes it less expensive than floating point numbers, which take eight. Would you like to argue that floating point doubles are unprofessional and wasteful? As Dave pointed out, and you apparently agreed with him enough to quote him TWICE (once in each of two blog posts), history of computing is full of premature optimizations for space. (In fact, some of these may have been justified by the technical limitations of the day.) Technically Unicode is also limited, but it is limited to over one million code
Re: Newbie question about text encoding
On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote: On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I think that the main point of the post, that many Unicode chars are truly planetary rather than just national/regional, is excellent. snipped You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html I consider it unethical to make semantic changes to a published work in place without acknowledgement. Fixing minor typos or spelling errors, or dead links, is okay. But any edit that changes the meaning should be commented on, either by an explicit note on the page itself, or by striking out the previous content and inserting the new. Dunno What you are grumping about… Anyway the attribution is made more explicit – footnote 5 in http://blog.languager.org/2015/03/whimsical-unicode.html. Note Terry Reedy's post who mainly objected was already acked earlier. Ive just added one more ack¹ And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication. As for the content of the essay, it is currently rather unfocused. True. It appears to be more of a list of here are some Unicode characters I think are interesting, divided into subgroups, oh and here are some I personally don't have any use for, which makes them silly than any sort of discussion about the universality of Unicode. That makes it rather idiosyncratic and parochial. Why should obscure maths symbols be given more importance than obscure historical languages? Idiosyncratic ≠ parochial I think that the universality of Unicode could be explained in a single sentence: It is the aim of Unicode to be the one character set anyone needs to represent every character, ideogram or symbol (but not necessarily distinct glyph) from any existing or historical human language. I can expand on that, but in a nutshell that is it. You state: APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset ... Tsk Tsk – dihonest snipping. I wrote | APL and Z Notation are two notable languages APL is a programming language | and Z a specification language that did not tie themselves down to a | restricted charset even in the day that ASCII ruled. so its clear that the restricted applies to ASCII You list ideographs such as Cuneiform under Icons. They are not icons. They are a mixture of symbols used for consonants, syllables, and logophonetic, consonantal alphabetic and syllabic signs. That sits them firmly in the same categories as modern languages with consonants, ideogram languages like Chinese, and syllabary languages like Cheyenne. Ok changed to iconic. Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages. In 2015 when someone sees them and recognizes them, they are 'those things that Sumerians/Egyptians wrote' No one except a rare expert knows those languages Just because native readers of Cuneiform are all dead doesn't make Cuneiform unimportant. There are probably more people who need to write Cuneiform than people who need to write APL source code. You make a comment: To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these?? But that is nonsense, and it contradicts your earlier quoting of Dave Angel. Why are you so worried about an (illusionary) minor optimization? 2 4 as far as I am concerned. [If you disagree one man's illusionary is another's waking] Whether code points are allocated or not doesn't affect how much space they take up. There are millions of unused Unicode code points today. If they are allocated tomorrow, the space your documents take up will not increase one byte. Allocating code points to Cuneiform has not increased the space needed by Unicode at all. Two bytes alone is not enough for even existing human languages (thanks China). For hardware related reasons, it is faster and more efficient to use four bytes than three, so the obvious and dumb (in the
Re: Newbie question about text encoding
On Wednesday, March 4, 2015 at 12:07:06 AM UTC+5:30, jmf wrote: Le mardi 3 mars 2015 19:04:06 UTC+1, Rustom Mody a écrit : On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I think that the main point of the post, that many Unicode chars are truly planetary rather than just national/regional, is excellent. snipped You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] Emoji and Dingbats are now part of Unicode. They should be considered as well as a 1 or a a or a mathematical alpha. So, there is nothing special to say about them. jmf Maybe you missed this section: http://blog.languager.org/2015/03/whimsical-unicode.html#half-assed It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the 3-way - ASCII - BMP - Unicode Personally I would be pleased if 훌 were used for the math-lambda and λ left alone for Greek-speaking users' identifiers. However one should draw a line between personal preferences and a univeral(izable) standard. As of now, λ works in blogger whereas 훌 breaks blogger -- gets replaced by �. Similar breakages are current in Java, Javascript, Emacs, Mysql, Idle and Windows, various fonts etc etc. [Only one of these is remotely connected with python] So BMP is practical, 7.0 is idealistic. You are free too pick -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I think that the main point of the post, that many Unicode chars are truly planetary rather than just national/regional, is excellent. snipped You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring and seems contrary to your main point. Ok Done References to gibberish removed from http://blog.languager.org/2015/02/universal-unicode.html I consider it unethical to make semantic changes to a published work in place without acknowledgement. Fixing minor typos or spelling errors, or dead links, is okay. But any edit that changes the meaning should be commented on, either by an explicit note on the page itself, or by striking out the previous content and inserting the new. Dunno What you are grumping about… Anyway the attribution is made more explicit – footnote 5 in http://blog.languager.org/2015/03/whimsical-unicode.html. Note Terry Reedy's post who mainly objected was already acked earlier. Ive just added one more ack¹ And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication. As for the content of the essay, it is currently rather unfocused. True. It appears to be more of a list of here are some Unicode characters I think are interesting, divided into subgroups, oh and here are some I personally don't have any use for, which makes them silly than any sort of discussion about the universality of Unicode. That makes it rather idiosyncratic and parochial. Why should obscure maths symbols be given more importance than obscure historical languages? Idiosyncratic ≠ parochial I think that the universality of Unicode could be explained in a single sentence: It is the aim of Unicode to be the one character set anyone needs to represent every character, ideogram or symbol (but not necessarily distinct glyph) from any existing or historical human language. I can expand on that, but in a nutshell that is it. You state: APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset ... Tsk Tsk – dihonest snipping. I wrote | APL and Z Notation are two notable languages APL is a programming language | and Z a specification language that did not tie themselves down to a | restricted charset even in the day that ASCII ruled. so its clear that the restricted applies to ASCII You list ideographs such as Cuneiform under Icons. They are not icons. They are a mixture of symbols used for consonants, syllables, and logophonetic, consonantal alphabetic and syllabic signs. That sits them firmly in the same categories as modern languages with consonants, ideogram languages like Chinese, and syllabary languages like Cheyenne. Ok changed to iconic. Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages. In 2015 when someone sees them and recognizes them, they are 'those things that Sumerians/Egyptians wrote' No one except a rare expert knows those languages Just because native readers of Cuneiform are all dead doesn't make Cuneiform unimportant. There are probably more people who need to write Cuneiform than people who need to write APL source code. You make a comment: To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these?? But that is nonsense, and it contradicts your earlier quoting of Dave Angel. Why are you so worried about an (illusionary) minor optimization? 2 4 as far as I am concerned. [If you disagree one man's illusionary is another's waking] Whether code points are allocated or not doesn't affect how much space they take up. There are millions of unused Unicode code points today. If they are allocated tomorrow, the space your documents take up will not increase one byte. Allocating code points to Cuneiform has not increased the space needed by Unicode at all. Two bytes alone is not enough for even existing human languages (thanks China). For hardware related reasons, it is faster and more efficient to use four bytes than three, so the obvious and dumb (in the simplest thing which will work) way to store Unicode is UTF-32, which takes a full four bytes per code point, regardless of whether there are 65537 code points or 1114112. That makes it less
Re: Newbie question about text encoding
On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: It is easy to mock what is not important to you. I daresay kids adding emoji to their 10 character tweets would mock all the useless maths symbols in Unicode too. Definitely! Who ever sings do you wanna build an integral sign? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody rustompm...@gmail.com wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the 3-way - ASCII - BMP - Unicode How is that more useful? Aside from storage optimizations (in which the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is not significantly different from the rest of Unicode. Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why do you keep talking about 7.0 as if it's a recent change? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 02/27/2015 06:54 AM, Steven D'Aprano wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because you can't fake what you ain't got.) If I recall correctly, disk access is about 1 times slower than RAM, so virtual memory is *at least* that much slower than real memory. It's so much more complicated than that, that I hardly know where to start. [snip technical details] As interesting as they were, none of those details will make swap faster, hence my comment that virtual memory is *at least* 1 times slower than RAM. The term virtual memory is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of running in a swapfile as opposed to running in physical RAM. Yes, a page fault takes on the order of 10,000 times as long as an access to a location in L1 cache. I suspect it's a lot smaller though if the swapfile is on an SSD drive. The first byte is that slow. But once the fault is resolved, the nearby bytes are in physical memory, and some of them are in L3, L2, and L1. So you're not running in the swapfile any more. And even when you run off the end of the page, fetching the sequentially adjacent page from a hard disk is much faster. And if the disk has well designed buffering, faster yet. The OS tries pretty hard to keep the swapfile unfragmented. The trick is to minimize the number of page faults, especially to random locations. If you're getting lots of them, it's called thrashing. There are tools to help with that. To minimize page faults on code, linking with a good working-set-tuner can help, though I don't hear of people bothering these days. To minimize page faults on data, choosing one's algorithm carefully can help. For example, in scanning through a typical matrix, row order might be adjacent locations, while column order might be scattered. Not really much different than reading a text file. If you can arrange to process it a line at a time, rather than reading the whole file into memory, you generally minimize your round-trips to disk. And if you need to randomly access it, it's quite likely more efficient to memory map it, in which case it temporarily becomes part of the swapfile system. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. Downside of that is twofold: firstly the complexity that I already mentioned, and secondly you pay the startup cost on first usage. So you might get into the program a bit faster, but as soon as you go to any feature you didn't already hit this session, the program pauses for a bit and loads it. Sometimes startup cost is the best time to do this sort of thing. If the modules are small enough this may not be noticeable but yes I do accept there may be delays on first usage. As to the complexity it has been my observation that as the memory footprint available to programmers has increase they have become less less skilled at writing code. of course my time as a professional programmer was over 20 years ago on 8 bit micro controllers with 8k of ROM (eventually, original I only had 2k to play with) 128 Bytes (yes bytes!) of RAM so I am very out of date. I now play with python because it is so much less demanding of me which probably makes me just a guilty :-) Of course, there is an easy way to implement exactly what you're asking for: use separate programs for everything, instead of expecting a megantic office suite[1] to do everything for you. Just get yourself a nice simple text editor, then invoke other programs - maybe from a terminal, or maybe from within the editor - to do the rest of the work. A simple disk cache will mean that previously-used programs start up quickly. Libre office was sighted as just one example Video editing suites are another that could be used as an example (perhaps more so, does the rendering engine need to be loaded until you start generating the output? a small delay here would be insignificant) ChrisA [1] It's slightly less bloated than the gigantic office suite sold by a top-end software company. -- You don't sew with a fork, so I see no reason to eat with knitting needles. -- Miss Piggy, on eating Chinese Food -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Feb 28, 2015 at 3:45 AM, alister alister.nospam.w...@ntlworld.com wrote: On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. Downside of that is twofold: firstly the complexity that I already mentioned, and secondly you pay the startup cost on first usage. So you might get into the program a bit faster, but as soon as you go to any feature you didn't already hit this session, the program pauses for a bit and loads it. Sometimes startup cost is the best time to do this sort of thing. If the modules are small enough this may not be noticeable but yes I do accept there may be delays on first usage. As to the complexity it has been my observation that as the memory footprint available to programmers has increase they have become less less skilled at writing code. Perhaps, but on the other hand, the skill of squeezing code into less memory is being replaced by other skills. We can write code that takes the simple/dumb approach, let it use an entire megabyte of memory, and not care about the cost... and we can write that in an hour, instead of spending a week fiddling with it. Reducing the development cycle time means we can add all sorts of cool features to a program, all while the original end user is still excited about it. (Of course, a comparison between today's World Wide Web and that of the 1990s suggests that these cool features aren't necessarily beneficial, but still, we have the option of foregoing austerity.) Video editing suites are another that could be used as an example (perhaps more so, does the rendering engine need to be loaded until you start generating the output? a small delay here would be insignificant) Hmm, I'm not sure that's actually a big deal, because your *data* will dwarf the code. I can fire up sox and avconv, both fairly large programs, and their code will all sit comfortably in memory; but then they get to work on my data, and suddenly my hard disk is chewing through 91GB of content. Breaking up avconv into a dozen pieces wouldn't make a dent in 91GB. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because you can't fake what you ain't got.) If I recall correctly, disk access is about 1 times slower than RAM, so virtual memory is *at least* that much slower than real memory. It's so much more complicated than that, that I hardly know where to start. [snip technical details] As interesting as they were, none of those details will make swap faster, hence my comment that virtual memory is *at least* 1 times slower than RAM. Nonsense. On all of my machines, virtual memory _is_ RAM almost all of the time. I don't do the type of things that force the usage of swap. -- Grant Edwards grant.b.edwardsYow! ... I want FORTY-TWO at TRYNEL FLOATATION SYSTEMS gmail.cominstalled within SIX AND A HALF HOURS!!! -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote: The term virtual memory is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of running in a swapfile as opposed to running in physical RAM. Given that this started with a quote about you can't fake what you ain't got, I would say that, yes, this refers to using hard disk to provide more RAM. If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are huge. But suppose you need to run a program that's larger than your available RAM. On MS-DOS, sometimes you'd need to work with program overlays (a concept borrowed from older systems, but ones that I never worked on, so I'm going back no further than DOS here). You get a *massive* complexity hit the instant you start using them, whether your program would have been able to fit into memory on some systems or not. Just making it possible to have only part of your code in memory places demands on your code that you, the programmer, have to think about. With virtual memory, though, you just write your code as if it's all in memory, and some of it may, at some times, be on disk. Less code to debug = less time spent debugging. The performance question is largely immaterial (you'll be using the disk either way), but the savings on complexity are tremendous. And then when you do find yourself running on a system with enough RAM? No code changes needed, and full performance. That's where virtual memory shines. It's funny how the world changes, though. Back in the 90s, virtual memory was the key. No home computer ever had enough RAM. Today? A home-grade PC could easily have 16GB... and chances are you don't need all of that. So we go for the opposite optimization: disk caching. Apart from when I rebuild my Audio-Only Frozen project [1] and the caches get completely blasted through, heaps and heaps of my work can be done inside the disk cache. Hey, Sikorsky, got any files anywhere on the hard disk matching *Pastel*.iso case insensitively? *chug chug chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case insensitively? *zip* Yeah, here it is. I didn't tell the first search to hold all that file system data in memory; the hard drive controller managed it all for me, and I got the performance benefit. Same as the above: the main benefit is that this sort of thing requires zero application code complexity. It's all done in a perfectly generic way at a lower level. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 2015-02-27, Grant Edwards invalid@invalid.invalid wrote: On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because you can't fake what you ain't got.) If I recall correctly, disk access is about 1 times slower than RAM, so virtual memory is *at least* that much slower than real memory. It's so much more complicated than that, that I hardly know where to start. [snip technical details] As interesting as they were, none of those details will make swap faster, hence my comment that virtual memory is *at least* 1 times slower than RAM. Nonsense. On all of my machines, virtual memory _is_ RAM almost all of the time. I don't do the type of things that force the usage of swap. And on some of the embedded systems I work on, _all_ virtual memory is RAM 100.000% of the time. -- Grant Edwards grant.b.edwardsYow! Don't SANFORIZE me!! at gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are huge. But suppose you need to run a program that's larger than your available RAM. On MS-DOS, sometimes you'd need to work with program overlays (a concept borrowed from older systems, but ones that I never worked on, so I'm going back no further than DOS here). You get a *massive* complexity hit the instant you start using them, whether your program would have been able to fit into memory on some systems or not. Just making it possible to have only part of your code in memory places demands on your code that you, the programmer, have to think about. With virtual memory, though, you just write your code as if it's all in memory, and some of it may, at some times, be on disk. Less code to debug = less time spent debugging. The performance question is largely immaterial (you'll be using the disk either way), but the savings on complexity are tremendous. And then when you do find yourself running on a system with enough RAM? No code changes needed, and full performance. That's where virtual memory shines. ChrisA I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. obviously with large memory virtual mem there is no need to un-install them once loaded. -- Ralph's Observation: It is a mistake to let any mechanical object realise that you are in a hurry. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. Downside of that is twofold: firstly the complexity that I already mentioned, and secondly you pay the startup cost on first usage. So you might get into the program a bit faster, but as soon as you go to any feature you didn't already hit this session, the program pauses for a bit and loads it. Sometimes startup cost is the best time to do this sort of thing. Of course, there is an easy way to implement exactly what you're asking for: use separate programs for everything, instead of expecting a megantic office suite[1] to do everything for you. Just get yourself a nice simple text editor, then invoke other programs - maybe from a terminal, or maybe from within the editor - to do the rest of the work. A simple disk cache will mean that previously-used programs start up quickly. ChrisA [1] It's slightly less bloated than the gigantic office suite sold by a top-end software company. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 02/27/2015 09:22 AM, Chris Angelico wrote: On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote: The term virtual memory is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of running in a swapfile as opposed to running in physical RAM. Given that this started with a quote about you can't fake what you ain't got, I would say that, yes, this refers to using hard disk to provide more RAM. If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are huge. But suppose you need to run a program that's larger than your available RAM. On MS-DOS, sometimes you'd need to work with program overlays (a concept borrowed from older systems, but ones that I never worked on, so I'm going back no further than DOS here). You get a *massive* complexity hit the instant you start using them, whether your program would have been able to fit into memory on some systems or not. Just making it possible to have only part of your code in memory places demands on your code that you, the programmer, have to think about. With virtual memory, though, you just write your code as if it's all in memory, and some of it may, at some times, be on disk. Less code to debug = less time spent debugging. The performance question is largely immaterial (you'll be using the disk either way), but the savings on complexity are tremendous. And then when you do find yourself running on a system with enough RAM? No code changes needed, and full performance. That's where virtual memory shines. It's funny how the world changes, though. Back in the 90s, virtual memory was the key. No home computer ever had enough RAM. Today? A home-grade PC could easily have 16GB... and chances are you don't need all of that. So we go for the opposite optimization: disk caching. Apart from when I rebuild my Audio-Only Frozen project [1] and the caches get completely blasted through, heaps and heaps of my work can be done inside the disk cache. Hey, Sikorsky, got any files anywhere on the hard disk matching *Pastel*.iso case insensitively? *chug chug chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case insensitively? *zip* Yeah, here it is. I didn't tell the first search to hold all that file system data in memory; the hard drive controller managed it all for me, and I got the performance benefit. Same as the above: the main benefit is that this sort of thing requires zero application code complexity. It's all done in a perfectly generic way at a lower level. In 1973, I did manual swapping to an external 8k ramdisk. It was a box that sat on the floor and contained 8k of core memory (not semiconductor). The memory was non-volatile, so it contained the working copy of my code. Then I built a small swapper that would bring in the set of routines currently needed. My onboard RAM (semiconductor) was 1.5k, which had to hold the swapper, the code, and the data. I was writing a GPS system for shipboard use, and the final version of the code had to fit entirely in EPROM, 2k of it. But debugging EPROM code is a pain, since every small change took half an hour to make new chips. Later, I built my first PC with 512k of RAM, and usually used much of it as a ramdisk, since programs didn't use nearly that amount. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 2015-02-27 16:45, alister wrote: On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. Downside of that is twofold: firstly the complexity that I already mentioned, and secondly you pay the startup cost on first usage. So you might get into the program a bit faster, but as soon as you go to any feature you didn't already hit this session, the program pauses for a bit and loads it. Sometimes startup cost is the best time to do this sort of thing. If the modules are small enough this may not be noticeable but yes I do accept there may be delays on first usage. I suppose you could load the basic parts first so that the user can start working, and then load the additional features in the background. As to the complexity it has been my observation that as the memory footprint available to programmers has increase they have become less less skilled at writing code. of course my time as a professional programmer was over 20 years ago on 8 bit micro controllers with 8k of ROM (eventually, original I only had 2k to play with) 128 Bytes (yes bytes!) of RAM so I am very out of date. I now play with python because it is so much less demanding of me which probably makes me just a guilty :-) Of course, there is an easy way to implement exactly what you're asking for: use separate programs for everything, instead of expecting a megantic office suite[1] to do everything for you. Just get yourself a nice simple text editor, then invoke other programs - maybe from a terminal, or maybe from within the editor - to do the rest of the work. A simple disk cache will mean that previously-used programs start up quickly. Libre office was sighted as just one example Video editing suites are another that could be used as an example (perhaps more so, does the rendering engine need to be loaded until you start generating the output? a small delay here would be insignificant) ChrisA [1] It's slightly less bloated than the gigantic office suite sold by a top-end software company. -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On 02/27/2015 11:00 AM, alister wrote: On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are huge. But suppose you need to run a program that's larger than your available RAM. On MS-DOS, sometimes you'd need to work with program overlays (a concept borrowed from older systems, but ones that I never worked on, so I'm going back no further than DOS here). You get a *massive* complexity hit the instant you start using them, whether your program would have been able to fit into memory on some systems or not. Just making it possible to have only part of your code in memory places demands on your code that you, the programmer, have to think about. With virtual memory, though, you just write your code as if it's all in memory, and some of it may, at some times, be on disk. Less code to debug = less time spent debugging. The performance question is largely immaterial (you'll be using the disk either way), but the savings on complexity are tremendous. And then when you do find yourself running on a system with enough RAM? No code changes needed, and full performance. That's where virtual memory shines. ChrisA I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications. examples libre office, I rarely need the mail merge function, the word count and may other features that could be added into the running application on demand rather than all at once. obviously with large memory virtual mem there is no need to un-install them once loaded. I can't say how Linux handles it (I'd like to know, but haven't needed to yet), but in Windows (NT, XP, etc), a DLL is not loaded, but rather mapped. And it's not copied into the swapfile, it's mapped directly from the DLL. The mapping mode is copy-on-write which means that read=only portions are swapped directly from the DLL, on first usage, while read-write portions (eg. static/global variables, relocation modifications) are copied on first use to the swap file. I presume EXE's are done the same way, but never had a need to know. If that's the case on the architectures you're talking about, then the problem of slow loading is not triggered by the memory usage, but by lots of initialization code. THAT's what should be deferred for seldom-used portions of code. The main point of a working-set-tuner is to group sections of code together that are likely to be used together. To take an extreme case, all the fatal exception handlers should be positioned adjacent to each other in linear memory, as it's unlikely that any of them will be needed, and the code takes up no time or space in physical memory. Also (in Windows), a DLL can be pre-relocated, so that it has a preferred address to be loaded into memory. If that memory is available when it gets loaded (actually mapped), then no relocation needs to happen, which saves time and swap space. In the X86 architecture, most code is self-relocating, everything is relative. But references to other DLL's and jump tables were absolute, so they needed to be relocated at load time, when final locations were nailed down. Perhaps the authors of bloated applications have forgotten how to do these, as the defaults in the linker puts all DLL's in the same location, meaning all but the first will need relocating. But system DLL's are (were) each given unique addresses. On one large project, I added the build step of assigning these base addresses. Each DLL had to start on a 64k boundary, and I reserved some fractional extra space between them in case one would grow. Then every few months, we double-checked that they didn't overlap, and if necessary adjusted the start addresses. We didn't just automatically assign closest addresses, because frequently some of the DLL's would be updated independently of the others. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Sat, Feb 28, 2015 at 7:52 AM, Dave Angel da...@davea.name wrote: If that's the case on the architectures you're talking about, then the problem of slow loading is not triggered by the memory usage, but by lots of initialization code. THAT's what should be deferred for seldom-used portions of code. s/should/can/ It's still not a clear case of should, as it's all a big pile of trade-offs. A few weeks ago I made a very deliberate change to a process to force some code to get loaded and initialized earlier, to prevent an unexpected (and thus surprising) slowdown on first use. (It was, in fact, a Python 'import' statement, so all I had to do was add a dummy import in the main module - with, of course, a comment making it clear that this was necessary, even though the name wasn't used.) But yes, seldom-used code can definitely have its initialization deferred if you need to speed up startup. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Newbie question about text encoding
On Fri, 27 Feb 2015 19:14:00 +, MRAB wrote: I suppose you could load the basic parts first so that the user can start working, and then load the additional features in the background. quite possible my opinion on this is very fluid it may work for some applications, it probably wouldn't for others. with python it is generally considered good practice to import all modules at the start of a program but there are valid cases fro only importing a module if actually needed. -- Some people have parts that are so private they themselves have no knowledge of them. -- https://mail.python.org/mailman/listinfo/python-list