Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 19:48:06, Paul Rubin wrote: > Terry Reedy writes: >> py> s = chr(0x + 1) >> py> a, b = s > That looks like a 3.2- narrow build. Such which treat unicode strings > as sequences of code units rather than sequences of codepoints. Not an > implementation bug, but compromise design that goes back about a > decade to when unicode was added to Python. Actually, this compromise design was new in 3.0. In 2.x, unicode strings were sequences of code points. Narrow builds rejected any code points > 0x: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> s = unichr(0x + 1) Traceback (most recent call last): File "", line 1, in ValueError: unichr() arg not in range(0x1) (narrow Python build) -- HansM -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano: Using variable-sized strings like UTF-8 and UTF-16 for in-memory representations is a terrible idea because you can't assume that people will only every want to index the first or last character. On average, you need to scan half the string, one character at a time. In Big-Oh, we can ignore the factor of 1/2 and just say we scan the string, O(N). In the majority of cases you can remove excessive scanning by caching the most recent index->offset result. If the next index request is nearer the cached index than to the beginning then iterate from that offset. This converts many operations from quadratic to linear. Locality of reference is common and can often be reasonably exploited. However, exposing the variable length nature of UTF-8 allows the application to choose efficient techniques for more cases. Neil -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
"Blind Anagram" writes: > This is an average slowdown by a factor of close to 2.3 on 3.3 when > compared with 3.2. > > I am not posting this to perpetuate this thread but simply to ask > whether, as you suggest, I should report this as a possible problem with > the beta? Being a beta release, is it certain that this release has been compiled with the same optimization level as 3.2? -- Piet van Oostrum WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4] -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano writes: > Paul Rubin already told you about his experience using OCR to generate > multiple terrabytes of text, and how he would not be happy if that was > stored in UCS-4. That particular text was stored on disk as compressed XML that had UTF-8 in the data fields, but I think Roy is right that it would have compressed to around the same size in UCS-4. Converting it to UCS-4 on input would have bloated up the memory footprint and that was the issue of concern to me. > Pittance or not, I do not believe that people will widely abandon compact > storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Looking at http://www.icu-project.org/ the C++ classes seem to use UTF-16 sort like Python 3.2 :(. I'm not certain of this though. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote: > Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : > > > > > But they are not ascii pages, they are (as stated) MOSTLY ascii. > > > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses > > > a much more memory-expensive encoding than UTF-8. > > > Well, it seems some software producers know what they > are doing. > > >>> '€'.encode('cp1252') > b'\x80' > >>> '€'.encode('mac-roman') > b'\xdb' > >>> '€'.encode('iso-8859-1') > > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' > in position 0: ordinal not in range(256) You want the Euro-sign in iso-8859-1?? I object. I want the rupee sign ( ₹ ) http://en.wikipedia.org/wiki/Indian_rupee_sign And while we are at it, why not move it (both?) into ASCII? The problem(s) are: 1. We dont really understand what you are objecting to. 2. Utf-8 like Huffman coding is a prefix code http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today Like Huffman coding, it compresses based on a statistical argument. 3. Unlike Huffman coding the statistics is very political: "Is the Euro more important or Chinese ideograms?" depends on whom you ask -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Mon, 20 Aug 2012 00:44:22 -0400, Roy Smith wrote: > In article <5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com>, > Steven D'Aprano wrote: > >> > So it may be with utf-8 someday. >> >> Only if you believe that people's ability to generate data will remain >> lower than people's ability to install more storage. > > We're not talking *data*, we're talking *text*. Most of those > whatever-bytes people are generating are images, video, and music. Text > is a pittance compared to those. Paul Rubin already told you about his experience using OCR to generate multiple terrabytes of text, and how he would not be happy if that was stored in UCS-4. HTML is text. XML is text. SVG is text. Source code is text. Email is text. (Well, it's actually bytes, but it looks like ASCII text.) Log files are text, and they can fill a hard drive pretty quickly. Lots of data is text. Pittance or not, I do not believe that people will widely abandon compact storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Given that we're still trying to convince people to use UTF-8 over ASCII, I reckon it will be at least 40 years before there's even a slim chance of migrating from UTF-8 to UCS-4 in a widespread manner. In the IT world, that's close enough to "never" -- we might not even be using Unicode in 2052. In any case, time will tell who is right. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
In article <5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com>, Steven D'Aprano wrote: > > So it may be with utf-8 someday. > > Only if you believe that people's ability to generate data will remain > lower than people's ability to install more storage. We're not talking *data*, we're talking *text*. Most of those whatever-bytes people are generating are images, video, and music. Text is a pittance compared to those. In any case, text on disk can easily be stored compressed. I would expect the UTF-8 and UTF-32 versions of a text file to compress to just about the same size. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 19:24:30 -0400, Roy Smith wrote: > In the primordial days of computing, using 8 bits to store a character > was a profligate waste of memory. What on earth did people need with > TWO cases of the alphabet That's obvious, surely? We need two cases so that we can distinguish helping Jack off a horse from helping jack off a horse. > (not to mention all sorts of weird > punctuation)? Eventually, memory became cheap enough that the > convenience of using one character per byte (not to mention 8-bit bytes) > outweighed the costs. And crazy things like sixbit and rad-50 got swept > into the dustbin of history. 8 bit bytes are much older than 8 bit characters. For a long time, ASCII characters used only 7 bits out of the 8. > So it may be with utf-8 someday. Only if you believe that people's ability to generate data will remain lower than people's ability to install more storage. Every few years, new sizes for storage media comes out. The first thing that happens is that people say "40 megabytes? I'll NEVER fill this hard drive up!". The second thing that happens is that they say "Dammit, my 40 MB hard drive is full, and a new one is too expensive, better delete some files." Followed shortly by "400 megabytes? I'll NEVER use that much space!" -- wash, rinse, repeat, through megabytes, gigabytes, terrabytes, and it will happen for petabytes next. So long as our ability to outstrip storage continues, compression and memory-efficient storage schemes will remain in demand. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy wrote: > On 8/19/2012 6:42 PM, Chris Angelico wrote: >> However, Python goes a bit further by making it VERY clear that this >> is a mere optimization, and that Unicode strings and bytes strings are >> completely different beasts. In Pike, it's possible to forget to >> encode something before (say) writing it to a socket. Everything works >> fine while you have only ASCII characters in the string, and then >> breaks when you have a >255 codepoint - or perhaps worse, when you >> have a 127 > Python writes strings to file objects, including open sockets, without > creating a bytes object -- IF the file is opened in text mode, which always > has an associated encoding, even if the default 'ascii'. From what you say, > this is what Pike is missing. In text mode, the library does the encoding, but an encoding still happens. > I am pretty sure that the obvious optimization has already been done. The > internal bytes of all-ascii text can safely be sent to a file with ascii (or > ascii-compatible) encoding without intermediate 'decoding'. I remember > several patches of that sort. If a string is internally ucs2 and the file is > declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly > (possibly with a byte swap). Maybe it doesn't take any memory change, but there is a data type change. A Unicode string cannot be sent over the network; an encoding is needed. In Pike, I can take a string like "\x20AC" (or "\u20ac" or "\U20ac", same thing) and manipulate it as a one-character string, but I cannot write it to a file or file-like object. I can, however, pass it through a codec (and there's string_to_utf8() for the convenience of the common case), and get back something like "\xe2\x82\xac", which is a three-byte string. The thing is, though, that this new string is of exactly the same data type as the original: 'string'. Which means that I could have a string containing Latin-1 but not ASCII characters, and Pike will happily write it to a socket without raising a compile-time or run-time error. Python, under the same circumstances, would either raise an error or quietly (and correctly) encode the data. But this is a relatively trivial point, in the scheme of things. Python has an excellent model now for handling Unicode strings, and I would STRONGLY recommend everyone to upgrade to 3.3. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 6:42 PM, Chris Angelico wrote: On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy wrote: Python has often copied or borrowed, with adjustments. This time it is the first. I should have added 'that I know of' ;-) Maybe it wasn't consciously borrowed, but whatever innovation is done, there's usually an obscure beardless language that did it earlier. :) Pike has a single string type, which can use the full Unicode range. If all codepoints are <256, the string width is 8 (measured in bits); if <65536, width is 16; otherwise 32. Using the inbuilt count_memory function (similar to the Python function used somewhere earlier in this thread, but which I can't at present put my finger to), I find that for strings of 16 bytes or more, there's a fixed 20-byte header plus the string content, stored in the correct number of bytes. (Pike strings, like Python ones, are immutable and do not need expansion room.) It is even possible that someone involved was even vaguely aware that there was an antecedent. The PEP makes no claim that I can see, but lays out the problem and goes right to details of a Python implementation. However, Python goes a bit further by making it VERY clear that this is a mere optimization, and that Unicode strings and bytes strings are completely different beasts. In Pike, it's possible to forget to encode something before (say) writing it to a socket. Everything works fine while you have only ASCII characters in the string, and then breaks when you have a >255 codepoint - or perhaps worse, when you have a 127 Python writes strings to file objects, including open sockets, without creating a bytes object -- IF the file is opened in text mode, which always has an associated encoding, even if the default 'ascii'. From what you say, this is what Pike is missing. I am pretty sure that the obvious optimization has already been done. The internal bytes of all-ascii text can safely be sent to a file with ascii (or ascii-compatible) encoding without intermediate 'decoding'. I remember several patches of that sort. If a string is internally ucs2 and the file is declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly (possibly with a byte swap). -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Monday, August 20, 2012 1:03:34 AM UTC+8, Blind Anagram wrote: > "Steven D'Aprano" wrote in message > > news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... > > > > On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: > > > > [...] > > If you can consistently replicate a 100% to 1000% slowdown in string > > handling, please report it as a performance bug: > > > > http://bugs.python.org/ > > > > Don't forget to report your operating system. > > > > > > For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) > > running Windows 7 x64. > > > > Running Python from a Windows command prompt, I got the following on Python > > 3.2.3 and 3.3 beta 2: > > > > python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" > > 1 loops, best of 3: 39.3 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" > > 1 loops, best of 3: 51.8 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > > 1 loops, best of 3: 52 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > > 1 loops, best of 3: 50.3 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > > 1 loops, best of 3: 51.6 usec per loop > > python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > > 1 loops, best of 3: 38.3 usec per loop > > python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > > 1 loops, best of 3: 50.3 usec per loop > > > > python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" > > 1 loops, best of 3: 24.5 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" > > 1 loops, best of 3: 24.7 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > > 1 loops, best of 3: 24.8 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > > 1 loops, best of 3: 24 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > > 1 loops, best of 3: 24.1 usec per loop > > python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > > 1 loops, best of 3: 24.4 usec per loop > > python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > > 1 loops, best of 3: 24.3 usec per loop > > > > This is an average slowdown by a factor of close to 2.3 on 3.3 when compared > > with 3.2. > > > > I am not posting this to perpetuate this thread but simply to ask whether, > > as you suggest, I should report this as a possible problem with the beta? Un, another set of functions for seeding up ASCII string othe pertions might be needed. But it is better that Python 3.3 supports unicode strings to be easy to be used by people in different languages first. Anyway I think Cython and Pyrex can be used to tackle this problem. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
In article , Chris Angelico wrote: > Really, the only viable alternative to PEP 393 is a fixed 32-bit > representation - it's the only way that's guaranteed to provide > equivalent semantics. The new storage format is guaranteed to take no > more memory than that, and provide equivalent functionality. In the primordial days of computing, using 8 bits to store a character was a profligate waste of memory. What on earth did people need with TWO cases of the alphabet (not to mention all sorts of weird punctuation)? Eventually, memory became cheap enough that the convenience of using one character per byte (not to mention 8-bit bytes) outweighed the costs. And crazy things like sixbit and rad-50 got swept into the dustbin of history. So it may be with utf-8 someday. Clearly, the world has moved to a 32-bit character set. Not all parts of the world know that yet, or are willing to admit it, but that doesn't negate the fact that it's true. Equally clearly, the concept of one character per byte is a big win. The obvious conclusion is that eventually, when memory gets cheap enough, we'll all be doing utf-32 and all this transcoding nonsense will look as antiquated as rad-50 does today. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy wrote: > On 8/19/2012 4:04 AM, Paul Rubin wrote: >> I realize the folks who designed and implemented PEP 393 are very smart >> cookies and considered stuff carefully, while I'm just an internet user >> posting an immediate impression of something I hadn't seen before (I >> still use Python 2.6), but I still have to ask: if the 393 approach >> makes sense, why don't other languages do it? > > Python has often copied or borrowed, with adjustments. This time it is the > first. We will see how it goes, but it has been tested for nearly a year > already. Maybe it wasn't consciously borrowed, but whatever innovation is done, there's usually an obscure beardless language that did it earlier. :) Pike has a single string type, which can use the full Unicode range. If all codepoints are <256, the string width is 8 (measured in bits); if <65536, width is 16; otherwise 32. Using the inbuilt count_memory function (similar to the Python function used somewhere earlier in this thread, but which I can't at present put my finger to), I find that for strings of 16 bytes or more, there's a fixed 20-byte header plus the string content, stored in the correct number of bytes. (Pike strings, like Python ones, are immutable and do not need expansion room.) However, Python goes a bit further by making it VERY clear that this is a mere optimization, and that Unicode strings and bytes strings are completely different beasts. In Pike, it's possible to forget to encode something before (say) writing it to a socket. Everything works fine while you have only ASCII characters in the string, and then breaks when you have a >255 codepoint - or perhaps worse, when you have a 127http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote: Well, it seems some software producers know what they are doing. '€'.encode('cp1252') b'\x80' '€'.encode('mac-roman') b'\xdb' '€'.encode('iso-8859-1') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) Yes, Python lets you choose your byte encoding from those and a hundred others. I believe all the codecs are now tested in both directions. It was not an easy task. As to the examples: Latin-1 dates to 1985 and before and the 1988 version was published as a standard in 1992. https://en.wikipedia.org/wiki/Latin-1 "The name euro was officially adopted on 16 December 1995." https://en.wikipedia.org/wiki/Euro No wonder Latin-1 does not contain the Euro sign. International standards organizations standards are relatively fixed. (The unicode consortium will not even correct misspelled character names.) Instead, new standards with a new number are adopted. For better or worse, private mappings are more flexible. In its Mac mapping Apple "replaced the generic currency sign ¤ with the euro sign €". (See Latin-1 reference.) Great if you use Euros, not so great if you were using the previous sign for something else. Microsoft changed an unneeded code to the Euro for Windows cp-1252. https://en.wikipedia.org/wiki/Windows-1252 "It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in Microsoft software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such mislabeling. This is now standard behavior in the draft HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1]" Lots of fun. Too bad Microsoft won't push utf-8 so we can all communicate text with much less chance of ambiguity. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 1:03 PM, Blind Anagram wrote: Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 39.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 51.8 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 52 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 50.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 51.6 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 38.3 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 50.3 usec per loop python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 24.5 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 24.7 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 24.8 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 24 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 24.1 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 24.4 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 24.3 usec per loop This is one test repeated 7 times with essentially irrelevant variations. The difference is less on my system (50%). Others report seeing 3.3 as faster. When I asked on pydev, the answer was don't bother making a tracker issue unless I was personally interested in investigating why search is relatively slow in 3.3 on Windows. Any change would have to not slow other operations or severely impact search on other systems. I suggest the same answer to you. If you seriously want to compare old and new unicode, go to http://hg.python.org/cpython/file/tip/Tools/stringbench/stringbench.py and click raw to download. Run on 3.2 and 3.3, ignoring the bytes times. Here is a version of the first comparison from stringbench: print(timeit('''('NOW IS THE TIME FOR ALL GOOD PEOPLE TO COME TO THE AID OF PYTHON'* 10).lower()''')) Results are 5.6 for 3.2 and .8 for 3.3. WOW! 3.3 is 7 times faster! OK, not fair. I cherry picked. The 7 times speedup in 3.3 likely is at least partly independent of the 393 unicode change. The same test in stringbench for bytes is twice as fast in 3.3 as 3.2, but only 2x, not 7x. In fact, it may have been the bytes/unicode comparison in 3.2 that suggested that unicode case conversion of ascii chrs might be made faster. The sum of the 3.3 unicode times is 109 versus 110 for 3.3 bytes and 125 for 3.2 unicode. This unweighted sum is not really fair since the raw times vary by a factor of at least 100. But is does suggest that anyone claiming that 3.3 unicode is overall 'slower' than 3.2 unicode has some work to do. There is also this. On my machine, the lowest bytes-time/unicode-time for 3.3 is .71. This suggests that there is not a lot of fluff left in the unicode code, and that not much is lost by the bytes to unicode switch for strings. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 18:03:34 +0100, Blind Anagram wrote: > "Steven D'Aprano" wrote in message > news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... > > > If you can consistently replicate a 100% to 1000% slowdown in string > > handling, please report it as a performance bug: > > > > http://bugs.python.org/ > > > > Don't forget to report your operating system. [...] > This is an average slowdown by a factor of close to 2.3 on 3.3 when > compared with 3.2. > > I am not posting this to perpetuate this thread but simply to ask > whether, as you suggest, I should report this as a possible problem with > the beta? Possibly, if it is consistent and non-trivial. Serious performance regressions are bugs. Trivial ones, not so much. Thanks to Terry Reedy, who has already asked the Python Devs about this issue, they have made it clear that they aren't hugely interested in micro-benchmarks in isolation. If you want the bug report to be taken seriously, you would need to run the full Python string benchmark. The results of that would be interesting to see. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote: > On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano > wrote: [...] >> The PEP explicitly states that it only uses a 1-byte format for ASCII >> strings, not Latin-1: > > I think you misunderstand the PEP then, because that is empirically > false. Yes I did misunderstand. Thank you for the clarification. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote: > Terry Reedy writes: >> I would call it O(k), where k is a selectable constant. Slowing access >> by a factor of 100 is hardly acceptable to me. > > If k is constant then O(k) is the same as O(1). That is how O notation > works. You might as well say that if N is constant, O(N**2) is constant too and just like magic you have now made Bubble Sort a constant-time sort function! That's not how it works. Of course *if* k is constant, O(k) is constant too, but k is not constant. In context we are talking about string indexing and slicing. There is no value of k, say, k = 2, for which you can say "People will sometimes ask for string[2] but never ask for string[3]". That is absurd. Since k can vary from 0 to N-1, we can say that the average string index lookup is k = (N-1)//2 which clearly depends on N. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Ian Kelly writes: print (type(bytes(range(256)).decode('latin1'))) > Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/2012 19:11, wxjmfa...@gmail.com wrote: Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. Imagine an us banking application, everything in ascii, except ... the € currency symbole, code point 0x20ac. Well, it seems some software producers know what they are doing. '€'.encode('cp1252') b'\x80' '€'.encode('mac-roman') b'\xdb' '€'.encode('iso-8859-1') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) jmf Well that's it then, the world stock markets will all collapse tonight when the news leaks out that those stupid Americans haven't yet realised that much of Europe (with at least one very noticeable and sensible exception :) uses Euros. I'd better sell all my stock holdings fast. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly wrote: > Note that this only describes the structure of "compact" string > objects, which I have to admit I do not fully understand from the PEP. > The wording suggests that it only uses the PyASCIIObject structure, > not the derived structures. It then says that for compact ASCII > strings "the UTF-8 data, the UTF-8 length and the wstr length are the > same as the length of the ASCII data." But these fields are part of > the PyCompactUnicodeObject structure, not the base PyASCIIObject > structure, so they would not exist if only PyASCIIObject were used. > It would also imply that compact non-ASCII strings are stored > internally as UTF-8, which would be surprising. Oh, now I get it. I had missed the part where it says "character data immediately follow the base structure". And the bit about the "UTF-8 data, the UTF-8 length and the wstr length" are not describing the contents of those fields, but rather where the data can be alternatively found since the fields don't exist. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin wrote: > Ian Kelly writes: > sys.getsizeof(bytes(range(256)).decode('latin1')) >> 329 > > Please try: > >print (type(bytes(range(256)).decode('latin1'))) > > to make sure that what comes back is actually a unicode string rather > than a byte string. As I understand it, the decode method never returns a byte string in Python 3, but if you insist: >>> print (type(bytes(range(256)).decode('latin1'))) -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Ian Kelly writes: sys.getsizeof(bytes(range(256)).decode('latin1')) > 329 Please try: print (type(bytes(range(256)).decode('latin1'))) to make sure that what comes back is actually a unicode string rather than a byte string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : > > > But they are not ascii pages, they are (as stated) MOSTLY ascii. > > E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses > > a much more memory-expensive encoding than UTF-8. > > Imagine an us banking application, everything in ascii, except ... the € currency symbole, code point 0x20ac. Well, it seems some software producers know what they are doing. >>> '€'.encode('cp1252') b'\x80' >>> '€'.encode('mac-roman') b'\xdb' >>> '€'.encode('iso-8859-1') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 08/19/2012 01:03 PM, Blind Anagram wrote: > "Steven D'Aprano" wrote in message > news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... > > On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: > > [...] > If you can consistently replicate a 100% to 1000% slowdown in string > handling, please report it as a performance bug: > > http://bugs.python.org/ > > Don't forget to report your operating system. > > > For interest, I ran your code snippets on my laptop (Intel core-i7 > 1.8GHz) running Windows 7 x64. > > Running Python from a Windows command prompt, I got the following on > Python 3.2.3 and 3.3 beta 2: > > python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" > 1 loops, best of 3: 39.3 usec per loop > python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" > 1 loops, best of 3: 51.8 usec per loop > python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > 1 loops, best of 3: 52 usec per loop > python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > 1 loops, best of 3: 50.3 usec per loop > python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > 1 loops, best of 3: 51.6 usec per loop > python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > 1 loops, best of 3: 38.3 usec per loop > python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > 1 loops, best of 3: 50.3 usec per loop > > python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" > 1 loops, best of 3: 24.5 usec per loop > python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" > 1 loops, best of 3: 24.7 usec per loop > python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > 1 loops, best of 3: 24.8 usec per loop > python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > 1 loops, best of 3: 24 usec per loop > python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > 1 loops, best of 3: 24.1 usec per loop > python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > 1 loops, best of 3: 24.4 usec per loop > python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > 1 loops, best of 3: 24.3 usec per loop > > This is an average slowdown by a factor of close to 2.3 on 3.3 when > compared with 3.2. > Using your measurement numbers, I get an average of 1.95, not 2.3 -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
wrote in message news:5dfd1779-9442-4858-9161-8f1a06d56...@googlegroups.com... Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit : "Steven D'Aprano" wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 39.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 51.8 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 52 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 50.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 51.6 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 38.3 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 50.3 usec per loop python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 24.5 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 24.7 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 24.8 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 24 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 24.1 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 24.4 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? I use win7 pro 32bits in intel? Thanks for reporting these numbers. To be clear: I'm not complaining, but the fact that there is a slow down is a clear indication (in my mind), there is a point somewhere. I may be reading your input wrongly, but it seems to me that you are not only reporting a slowdown but you are also suggesting that this slowdown is the result of bad design decisions by the Python development team. I don't want to get involved in the latter part of your argument because I am convinced that the Python team are doing their very best to find a good compromise between the various design constraints that they face in meeting these needs. Nevertheless, the post that I responded to contained the suggestion that slowdowns above 100% (which I took as a factor of 2) would be worth reporting as a possible bug. So I thought that it was worth asking about this as I may have misunderstood the level of slowdown that is worth reporting. There is also a potential problem in timings on laptops with turbo-boost (as I have), although the times look fairly consistent. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano wrote: > On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: >> There is some additional benefit for Latin-1 users, but this has nothing >> to do with Python. If Python is going to have the option of a 1-byte >> representation (and as long as we have the flexible representation, I >> can see no reason not to), > > The PEP explicitly states that it only uses a 1-byte format for ASCII > strings, not Latin-1: I think you misunderstand the PEP then, because that is empirically false. Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getsizeof(bytes(range(256)).decode('latin1')) 329 The constructed string contains all 256 Latin-1 characters, so if Latin-1 strings must be stored in the 2-byte format, then the size should be at least 512 bytes. It is not, so I think it must be using the 1-byte encoding. > "ASCII-only Unicode strings will again use only one byte per character" This says nothing one way or the other about non-ASCII Latin-1 strings. > "If the maximum character is less than 128, they use the PyASCIIObject > structure" Note that this only describes the structure of "compact" string objects, which I have to admit I do not fully understand from the PEP. The wording suggests that it only uses the PyASCIIObject structure, not the derived structures. It then says that for compact ASCII strings "the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data." But these fields are part of the PyCompactUnicodeObject structure, not the base PyASCIIObject structure, so they would not exist if only PyASCIIObject were used. It would also imply that compact non-ASCII strings are stored internally as UTF-8, which would be surprising. > and: > > "The data and utf8 pointers point to the same memory if the string uses > only ASCII characters (using only Latin-1 is not sufficient)." This says that if the data are ASCII, then the 1-byte representation and the utf8 pointer will share the same memory. It does not imply that the 1-byte representation is not used for Latin-1, only that it cannot also share memory with the utf8 pointer. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Terry Reedy writes: >> Meanwhile, an example of the 393 approach failing: > I am completely baffled by this, as this example is one where the 393 > approach potentially wins. What? The 393 approach is supposed to avoid memory bloat and that does the opposite. >> I was involved in a project that dealt with terabytes of OCR data of >> mostly English text. So the chars were mostly ascii, > 3.3 stores ascii pages 1 byte/char rather than 2 or 4. But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. > I doubt that there are really any non-bmp chars. You may be right about this. I thought about it some more after posting and I'm not certain that there were supplemental characters. > As Steven said, reject such false identifications. Reject them how? >> That's a natural for UTF-8 > 3.3 would convert to utf-8 for storage on disk. They are already in utf-8 on disk though that doesn't matter since they are also compressed. >> but the PEP-393 approach would bloat up the memory >> requirements by a factor of 4. > 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally > better than always? The bloat is in comparison with utf-8, in that example. > That looks like a 3.2- narrow build. Such which treat unicode strings > as sequences of code units rather than sequences of codepoints. Not an > implementation bug, but compromise design that goes back about a > decade to when unicode was added to Python. I thought the whole point of Python 3's disruptive incompatibility with Python 2 was to clean up past mistakes and compromises, of which unicode headaches was near the top of the list. So I'm surprised they seem to repeated a mistake there. > I would call it O(k), where k is a selectable constant. Slowing access > by a factor of 100 is hardly acceptable to me. If k is constant then O(k) is the same as O(1). That is how O notation works. I wouldn't believe the 100x figure without seeing it measured in real-world applications. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit : > "Steven D'Aprano" wrote in message > > news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... > > > > On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: > > > > [...] > > If you can consistently replicate a 100% to 1000% slowdown in string > > handling, please report it as a performance bug: > > > > http://bugs.python.org/ > > > > Don't forget to report your operating system. > > > > > > For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) > > running Windows 7 x64. > > > > Running Python from a Windows command prompt, I got the following on Python > > 3.2.3 and 3.3 beta 2: > > > > python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" > > 1 loops, best of 3: 39.3 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" > > 1 loops, best of 3: 51.8 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > > 1 loops, best of 3: 52 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > > 1 loops, best of 3: 50.3 usec per loop > > python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > > 1 loops, best of 3: 51.6 usec per loop > > python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > > 1 loops, best of 3: 38.3 usec per loop > > python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > > 1 loops, best of 3: 50.3 usec per loop > > > > python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" > > 1 loops, best of 3: 24.5 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" > > 1 loops, best of 3: 24.7 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" > > 1 loops, best of 3: 24.8 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" > > 1 loops, best of 3: 24 usec per loop > > python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" > > 1 loops, best of 3: 24.1 usec per loop > > python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" > > 1 loops, best of 3: 24.4 usec per loop > > python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" > > 1 loops, best of 3: 24.3 usec per loop > > > > This is an average slowdown by a factor of close to 2.3 on 3.3 when compared > > with 3.2. > > > > I am not posting this to perpetuate this thread but simply to ask whether, > > as you suggest, I should report this as a possible problem with the beta? I use win7 pro 32bits in intel? Thanks for reporting these numbers. To be clear: I'm not complaining, but the fact that there is a slow down is a clear indication (in my mind), there is a point somewhere. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 4:04 AM, Paul Rubin wrote: Meanwhile, an example of the 393 approach failing: I am completely baffled by this, as this example is one where the 393 approach potentially wins. I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, 3.3 stores ascii pages 1 byte/char rather than 2 or 4. > but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. I doubt that there are really any non-bmp chars. As Steven said, reject such false identifications. > That's a natural for UTF-8 3.3 would convert to utf-8 for storage on disk. but the PEP-393 approach would bloat up the memory requirements by a factor of 4. 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally better than always? py> s = chr(0x + 1) py> a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable. That looks like a 3.2- narrow build. Such which treat unicode strings as sequences of code units rather than sequences of codepoints. Not an implementation bug, but compromise design that goes back about a decade to when unicode was added to Python. At that time, there were only a few defined non-BMP chars and their usage was extremely rare. There are now more extended chars than BMP chars and usage will become more common even in English text. Pre 3.3, there are really 2 sub-versions of every Python version: a narrow build and a wide build version, with not very well documented different behaviors for any string with extended chars. That is and would have become an increasing problem as extended chars are increasingly used. If you want to say that what was once a practical compromise has become a design bug, I would not argue. In any case, 3.3 fixes that split and returns Python to being one cross-platform language. I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Python has often copied or borrowed, with adjustments. This time it is the first. We will see how it goes, but it has been tested for nearly a year already. Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding. I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. For strings less than k, access is O(len). I believe slicing would require re-indexing. As 393 was near adoption, I proposed a scheme using utf-16 (narrow builds) with a supplementary index of extended chars when there are any. That makes access O(1) if there are none and O(log(k)), where k is the number of extended chars in the string, if there are some. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
"Steven D'Aprano" wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 39.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 51.8 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 52 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 50.3 usec per loop python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 51.6 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 38.3 usec per loop python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 50.3 usec per loop python32\python" -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 24.5 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 24.7 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 24.8 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 24 usec per loop python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 24.1 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 24.4 usec per loop python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')" 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 4:54 AM, wxjmfa...@gmail.com wrote: About the exemples contested by Steven: eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')") And it is good enough to show the problem. Period. Repeating a false claim over and over does not make it true. Two people on pydev claim that 3.3 is *faster* on their systems (one unspecified, one OSX10.8). -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 15:25, Steven D'Aprano wrote: Not necessarily. Presumably you're scanning each page into a single string. Then only the pages containing a supplementary plane char will be bloated, which is likely to be rare. Especially since I don't expect your OCR application would recognise many non-BMP characters -- what does U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software doesn't recognise it, you can't get it in your output. (If you do, the OCR software has a nasty bug.) Anyway, in my ignorant opinion the proper fix here is to tell the OCR software not to bother trying to recognise Imperial Aramaic, Domino Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't expecting them in your source material. Not only will the scanning go faster, but you'll get fewer wrong characters. Consider the automated recognition of a CAPTCHA. As the chars have to be entered by the user on a keyboard, only the most basic charset can be used, so the problem of which chars are possible is quite limited. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote: > Steven D'Aprano writes: >> This standard data structure is called UCS-2 ... There's an extension >> to UCS-2 called UTF-16 > > My own understanding is UCS-2 simply shouldn't be used any more. Pretty much. But UTF-16 with lax support for surrogates (that is, surrogates are included but treated as two characters) is essentially UCS-2 with the restriction against surrogates lifted. That's what Python currently does, and Javascript. http://mathiasbynens.be/notes/javascript-encoding The reality is that support for the Unicode supplementary planes is pretty poor. Even when applications support it, most fonts don't have glyphs for the characters. Anything which makes handling of Unicode supplementary characters better is a step forward. >> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string >> operations are not O(1) but are O(N). That means they are slow, or >> buggy, pick one. > > This I don't see. What are the basic string operations? The ones I'm specifically referring to are indexing and copying substrings. There may be others. > * Examine the first character, or first few characters ("few" = "usually > bounded by a small constant") such as to parse a token from an input > stream. This is O(1) with either encoding. That's actually O(K), for K = "a few", whatever "a few" means. But we know that anything is fast for small enough N (or K in this case). > * Slice off the first N characters. This is O(N) with either encoding > if it involves copying the chars. I guess you could share references > into the same string, but if the slice reference persists while the > big reference is released, you end up not freeing the memory until > later than you really should. As a first approximation, memory copying is assumed to be free, or at least constant time. That's not strictly true, but Big Oh analysis is looking at algorithmic complexity. It's not a substitute for actual benchmarks. > Meanwhile, an example of the 393 approach failing: I was involved in a > project that dealt with terabytes of OCR data of mostly English text. I assume that this wasn't one giant multi-terrabyte string. > So > the chars were mostly ascii, but there would be occasional non-ascii > chars including supplementary plane characters, either because of > special symbols that were really in the text, or the typical OCR > confusion emitting those symbols due to printing imprecision. That's a > natural for UTF-8 but the PEP-393 approach would bloat up the memory > requirements by a factor of 4. Not necessarily. Presumably you're scanning each page into a single string. Then only the pages containing a supplementary plane char will be bloated, which is likely to be rare. Especially since I don't expect your OCR application would recognise many non-BMP characters -- what does U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software doesn't recognise it, you can't get it in your output. (If you do, the OCR software has a nasty bug.) Anyway, in my ignorant opinion the proper fix here is to tell the OCR software not to bother trying to recognise Imperial Aramaic, Domino Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't expecting them in your source material. Not only will the scanning go faster, but you'll get fewer wrong characters. [...] > I realize the folks who designed and implemented PEP 393 are very smart > cookies and considered stuff carefully, while I'm just an internet user > posting an immediate impression of something I hadn't seen before (I > still use Python 2.6), but I still have to ask: if the 393 approach > makes sense, why don't other languages do it? There has to be a first time for everything. > Ropes of UTF-8 segments seems like the most obvious approach and I > wonder if it was considered. Ropes have been considered and rejected because while they are asymptotically fast, in common cases the added complexity actually makes them slower. Especially for immutable strings where you aren't inserting into the middle of a string. http://mail.python.org/pipermail/python-dev/2000-February/002321.html PyPy has revisited ropes and uses, or at least used, ropes as their native string data structure. But that's ropes of *bytes*, not UTF-8. http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote: > Steven D'Aprano writes: >> result = text[end:] > > if end not near the end of the original string, then this is O(N) even > with fixed-width representation, because of the char copying. Technically, yes. But it's a straight copy of a chunk of memory, which means it's fast: your OS and hardware tries to make straight memory copies as fast as possible. Big-Oh analysis frequently glosses over implementation details like that. Of course, that assumption gets shaky when you start talking about extra large blocks, and it falls apart completely when your OS starts paging memory to disk. But if it helps to avoid irrelevant technical details, change it to text[end:end+10] or something. > if it is near the end, by knowing where the string data area ends, I > think it should be possible to scan backwards from the end, recognizing > what bytes can be the beginning of code points and counting off the > appropriate number. This is O(1) if "near the end" means "within a > constant". You know, I think you are misusing Big-Oh analysis here. It really wouldn't be helpful for me to say "Bubble Sort is O(1) if you only sort lists with a single item". Well, yes, that is absolutely true, but that's a special case that doesn't give you any insight into why using Bubble Sort as your general purpose sort routine is a terrible idea. Using variable-sized strings like UTF-8 and UTF-16 for in-memory representations is a terrible idea because you can't assume that people will only every want to index the first or last character. On average, you need to scan half the string, one character at a time. In Big-Oh, we can ignore the factor of 1/2 and just say we scan the string, O(N). That's why languages tend to use fixed character arrays for strings. Haskell is an exception, using linked lists which require traversing the string to jump to an index. The manual even warns: [quote] If you think of a Text value as an array of Char values (which it is not), you run the risk of writing inefficient code. An idiom that is common in some languages is to find the numeric offset of a character or substring, then use that number to split or trim the searched string. With a Text value, this approach would require two O(n) operations: one to perform the search, and one to operate from wherever the search ended. [end quote] http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 11:19, Chris Angelico wrote: On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat wrote: The date stamp is different but the Python version is the same Check out what 'sys.maxunicode' is in each of those Pythons. It's possible that one is a wide build and the other narrow. Ah ... I built my local version from source and no, I didn't read the makefile so I didn't configure for a wide build :-( not that I would have known the difference at that time. [lipska@ubuntu ~]$ python3.2 Python 3.2.3 (default, Jul 17 2012, 14:23:10) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 65535 >>> Later, I did an apt-get install idle3 which pulled down a precompiled IDLE from the Ubuntu repos This was obviously compiled 'wide' Python 3.2.3 (default, May 3 2012, 15:51:42) [GCC 4.6.3] on linux2 Type "copyright", "credits" or "license()" for more information. No Subprocess >>> import sys >>> sys.maxunicode 1114111 >>> All very interesting and enlightening Thanks lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/2012 09:54, wxjmfa...@gmail.com wrote: About the exemples contested by Steven: eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')") And it is good enough to show the problem. Period. The rest (you have to do this, you should not do this, why are you using these characters - amazing and stupid question -) does not count. The real problem is elsewhere. *Americans* do not wish a character occupies 4 bytes in *their* memory. The rest of the world does not count. The same thing happens with the utf-8 coding scheme. Technically, it is fine. But after n years of usage, one should recognize it just became an ascii2. Especially for those who undestand nothing in that field and are not even aware, characters are "coded". I'm the first to think, this is legitimate. Memory or "ability to treat all text in the same and equal way"? End note. This kind of discussion is not specific to Python, it always happen when there is some kind of conflict between ascii and non ascii users. Have a nice day. jmf Roughly translated. "I've been shot to pieces and having seen Monty Python and the Holy Grail I know what to do. Run away, run away" -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat wrote: > The date stamp is different but the Python version is the same Check out what 'sys.maxunicode' is in each of those Pythons. It's possible that one is a wide build and the other narrow. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 07:09, Steven D'Aprano wrote: This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with "To recap". Thank you for this excellent post, it has certainly cleared up a few things for me [snip] incidentally > But in UTF-16, ... [snip] > py> s = chr(0x + 1) > py> a, b = s > py> a > '\ud800' > py> b > '\udc00' in IDLE Python 3.2.3 (default, May 3 2012, 15:51:42) [GCC 4.6.3] on linux2 Type "copyright", "credits" or "license()" for more information. No Subprocess >>> s = chr(0x + 1) >>> a, b = s Traceback (most recent call last): File "", line 1, in a, b = s ValueError: need more than 1 value to unpack At a terminal prompt [lipska@ubuntu ~]$ python3.2 Python 3.2.3 (default, Jul 17 2012, 14:23:10) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = chr(0x + 1) >>> a, b = s >>> a '\ud800' >>> b '\udc00' >>> The date stamp is different but the Python version is the same No idea why this is happening, I just thought it was interesting lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit : > > internal implementation, and strings which fit exactly in Latin-1 will > And this is the crucial point. latin-1 is an obsolete and non usable coding scheme (esp. for european languages). We fall on the point I mentionned above. Microsoft know this, ditto for Apple, ditto for "TeX", ditto for the foundries. Even, "ISO" has recognized its error and produced iso-8859-15. The question? Why is it still used? jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote: > Steven D'Aprano wrote: >> I don't know where people are getting this myth that PEP 393 uses >> Latin-1 internally, it does not. Read the PEP, it explicitly states >> that 1-byte formats are only used for ASCII strings. > > From > > Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC > 4.6.1] on linux > Type "help", "copyright", "credits" or "license" for more information. import sys [sys.getsizeof("é"*i) for i in range(10)] > [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] Interesting. Say, I don't suppose you're using a 64-bit build? Because that would explain why your sizes are so larger than mine: py> [sys.getsizeof("é"*i) for i in range(10)] [25, 38, 39, 40, 41, 42, 43, 44, 45, 46] py> [sys.getsizeof("€"*i) for i in range(10)] [25, 40, 42, 44, 46, 48, 50, 52, 54, 56] py> c = chr(0x + 1) py> [sys.getsizeof(c*i) for i in range(10)] [25, 44, 48, 52, 56, 60, 64, 68, 72, 76] On re-reading the PEP more closely, it looks like I did misunderstand the internal implementation, and strings which fit exactly in Latin-1 will also use 1 byte per character. There are three structures used: PyASCIIObject PyCompactUnicodeObject PyUnicodeObject and the third one comes in three variant forms, for 1-byte, 2-byte and 4- byte data. So I stand corrected. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
About the exemples contested by Steven: eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')") And it is good enough to show the problem. Period. The rest (you have to do this, you should not do this, why are you using these characters - amazing and stupid question -) does not count. The real problem is elsewhere. *Americans* do not wish a character occupies 4 bytes in *their* memory. The rest of the world does not count. The same thing happens with the utf-8 coding scheme. Technically, it is fine. But after n years of usage, one should recognize it just became an ascii2. Especially for those who undestand nothing in that field and are not even aware, characters are "coded". I'm the first to think, this is legitimate. Memory or "ability to treat all text in the same and equal way"? End note. This kind of discussion is not specific to Python, it always happen when there is some kind of conflict between ascii and non ascii users. Have a nice day. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > And of course, taking the *entire* rest of the string isn't the only > thing you do. What if you want to take the next six characters after > that index? That would be constant time with a fixed-width storage > format. How often is this an issue in practice? I wonder how other languages deal with this. The examples I can think of are poor role models: 1. C/C++ - unicode impaired, other than a wchar type 2. Java - bogus UCS-2-like(?) representation for historical reasons Also has some modified UTF=8 for reasons that made no sense and that I don't remember 3. Haskell - basic string type is a linked list of code points. "hello" is five list nodes. New Data.Text library (much more efficient) uses something like ropes, I think, with UTF-16 underneath. 4. Erlang - I think like Haskell. Efficiently handles byte blocks. 5. Perl 6 -- ??? 6. Ruby - ??? (but probably quite slow like the rest of Ruby) 7. Objective C -- ??? 8, 9 ... (any other important ones?) -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin wrote: > Steven D'Aprano writes: >> result = text[end:] > > if end not near the end of the original string, then this is O(N) > even with fixed-width representation, because of the char copying. > > if it is near the end, by knowing where the string data area > ends, I think it should be possible to scan backwards from > the end, recognizing what bytes can be the beginning of code points and > counting off the appropriate number. This is O(1) if "near the end" > means "within a constant". Only if you know exactly where the end is (which requires storing and maintaining a character length - this may already be happening, I don't know). But that approach means you need to have code for both ways (forward search or reverse), and of course it relies on your encoding being reverse-scannable in this way (as UTF-8 is, but not all). And of course, taking the *entire* rest of the string isn't the only thing you do. What if you want to take the next six characters after that index? That would be constant time with a fixed-width storage format. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano writes: > result = text[end:] if end not near the end of the original string, then this is O(N) even with fixed-width representation, because of the char copying. if it is near the end, by knowing where the string data area ends, I think it should be possible to scan backwards from the end, recognizing what bytes can be the beginning of code points and counting off the appropriate number. This is O(1) if "near the end" means "within a constant". > You could say "Screw the full Unicode standard, who needs more than 64K No if you're claiming the language supports unicode it should be the whole standard. > You could do what Python 3.2 narrow builds do: use UTF-16 and leave it > up to the individual programmer to track character boundaries, I'm surprised the Python 3 implementers even considered that approach much less went ahead with it. It's obviously wrong. > You could add a whole lot more heavyweight infrastructure to strings, > turn them into suped-up ropes-on-steroids. I'm not persuaded that PEP 393 isn't even worse. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano writes: > This is a long post. If you don't feel like reading an essay, skip to the > very bottom and read my last few paragraphs, starting with "To recap". I'm very flattered that you took the trouble to write that excellent exposition of different Unicode encodings in response to my post. I can only hope some readers will benefit from it. I regret that I wasn't more clear about the perspective I posted from, i.e. that I'm already familiar with how those encodings work. After reading all of it, I still have the same skepticism on the main point as before, but I think I see what the issue in contention is, and some differences in perspectice. First of all, you wrote: > This standard data structure is called UCS-2 ... There's an extension > to UCS-2 called UTF-16 My own understanding is UCS-2 simply shouldn't be used any more. Unicode was historically supposed to be a 16-bit character set, but that turned out to not be enough, so the supplementary planes were added. UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in turn is rather clumsy and the later UTF-8 is better in a lot of ways, but both of these are at least capable of encoding all the character codes. On to the main issue: > * Variable-byte formats like UTF-8 and UTF-16 mean that basic string > operations are not O(1) but are O(N). That means they are slow, or buggy, > pick one. This I don't see. What are the basic string operations? * Examine the first character, or first few characters ("few" = "usually bounded by a small constant") such as to parse a token from an input stream. This is O(1) with either encoding. * Slice off the first N characters. This is O(N) with either encoding if it involves copying the chars. I guess you could share references into the same string, but if the slice reference persists while the big reference is released, you end up not freeing the memory until later than you really should. * Concatenate two strings. O(N) either way. * Find length of string. O(1) either way since you'd store it in the string header when you build the string in the first place. Building the string has to have been an O(N) operation in either representation. And finally: * Access the nth char in the string for some large random n, or maybe get a small slice from some random place in a big string. This is where fixed-width representation is O(1) while variable-width is O(N). What I'm not convinced of, is that the last thing happens all that often. Meanwhile, an example of the 393 approach failing: I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. That's a natural for UTF-8 but the PEP-393 approach would bloat up the memory requirements by a factor of 4. py> s = chr(0x + 1) py> a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable. I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote: > Scanning 4 characters (or a few dozen, say) to peel off a token in > parsing a UTF-8 string is no big deal. It gets more expensive if you > want to index far more deeply into the string. I'm asking how often > that is done in real code. It happens all the time. Let's say you've got a bunch of text, and you use a regex to scan through it looking for a match. Let's ignore the regular expression engine, since it has to look at every character anyway. But you've done your search and found your matching text and now want everything *after* it. That's not exactly an unusual use-case. mo = re.search(pattern, text) if mo: start, end = mo.span() result = text[end:] Easy-peasy, right? But behind the scenes, you have a problem: how does Python know where text[end:] starts? With fixed-size characters, that's O(1): Python just moves forward end*width bytes into the string. Nice and fast. With a variable-sized characters, Python has to start from the beginning again, and inspect each byte or pair of bytes. This turns the slice operation into O(N) and the combined op (search + slice) into O(N**2), and that starts getting *horrible*. As always, "everything is fast for small enough N", but you *really* don't want O(N**2) operations when dealing with large amounts of data. Insisting that the regex functions only ever return offsets to valid character boundaries doesn't help you, because the string slice method cannot know where the indexes came from. I suppose you could have a "fast slice" and a "slow slice" method, but really, that sucks, and besides all that does is pass responsibility for tracking character boundaries to the developer instead of the language, and you know damn well that they will get it wrong and their code will silently do the wrong thing and they'll say that Python sucks and we never used to have this problem back in the good old days with ASCII. Boo sucks to that. UCS-4 is an option, since that's fixed-width. But it's also bulky. For typical users, you end up wasting memory. That is the complaint driving PEP 393 -- memory is cheap, but it's not so cheap that you can afford to multiply your string memory by four just in case somebody someday gives you a character in one of the supplementary planes. If you have oodles of memory and small data sets, then UCS-4 is probably all you'll ever need. I hear that the club for people who have all the memory they'll ever need is holding their annual general meeting in a phone-booth this year. You could say "Screw the full Unicode standard, who needs more than 64K different characters anyway?" Well apart from Asians, and historians, and a bunch of other people. If you can control your data and make sure no non-BMP characters are used, UCS-2 is fine -- except Python doesn't actually use that. You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up to the individual programmer to track character boundaries, and we know how well that works. Luckily the supplementary planes are only rarely used, and people who need them tend to buy more memory and use wide builds. People who only need a few non-BMP characters in a narrow build generally just cross their fingers and hope for the best. You could add a whole lot more heavyweight infrastructure to strings, turn them into suped-up ropes-on-steroids. All those extra indexes mean that you don't save any memory. Because the objects are so much bigger and more complex, your CPU cache goes to the dogs and your code still runs slow. Which leaves us right back where we started, PEP 393. > Obviously one can concoct hypothetical examples that would suffer. If you think "slicing at arbitrary indexes" is a hypothetical example, I don't know what to say. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano wrote: > On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: > >> "a" will be stored as 1 byte/codepoint. >> >> Adding "é", it will still be stored as 1 byte/codepoint. > > Wrong. It will be 2 bytes, just like it already is in Python 3.2. > > I don't know where people are getting this myth that PEP 393 uses Latin-1 > internally, it does not. Read the PEP, it explicitly states that 1-byte > formats are only used for ASCII strings. From Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC 4.6.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> [sys.getsizeof("é"*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] >>> [sys.getsizeof("e"*i) for i in range(10)] [49, 50, 51, 52, 53, 54, 55, 56, 57, 58] >>> sys.getsizeof("é"*101)-sys.getsizeof("é") 100 >>> sys.getsizeof("e"*101)-sys.getsizeof("e") 100 >>> sys.getsizeof("€"*101)-sys.getsizeof("€") 200 I infer that (1) both ASCII and Latin1 strings require one byte per character. (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) over ASCII-only. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote: > The problem with strings containing surrogate pairs is that you could > inadvertently slice the string in the middle of the surrogate pair. That's the *least* of the problems with surrogate pairs. That would be easy to fix: check the point of the slice, and back up or forward if you're on a surrogate pair. But that's not good enough, because the surrogates could be anywhere in the string. You have to touch every single character in order to know how many there are. The problem with surrogate pairs is that they make basic string operations O(N) instead of O(1). -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: > "a" will be stored as 1 byte/codepoint. > > Adding "é", it will still be stored as 1 byte/codepoint. Wrong. It will be 2 bytes, just like it already is in Python 3.2. I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. > Adding "€", it will still be stored as 2 bytes/codepoint. That is correct. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: > The change does not just benefit ASCII users. It primarily benefits > anybody using a wide unicode build with strings mostly containing only > BMP characters. Just to be clear: If you have many strings which are *mostly* BMP, but have one or two non- BMP characters in *each* string, you will see no benefit. But if you have many strings which are all BMP, and only a few strings containing non-BMP characters, then you will see a big benefit. > Even for narrow build users, there is the benefit that > with approximately the same amount of memory usage in most cases, they > no longer have to worry about non-BMP characters sneaking in and > breaking their code. Yes! +1000 on that. > There is some additional benefit for Latin-1 users, but this has nothing > to do with Python. If Python is going to have the option of a 1-byte > representation (and as long as we have the flexible representation, I > can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: "ASCII-only Unicode strings will again use only one byte per character" and later: "If the maximum character is less than 128, they use the PyASCIIObject structure" and: "The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient)." > then it is going to be Latin-1 by definition, Certainly not, either in fact or in principle. There are a large number of 1-byte encodings, Latin-1 is hardly the only one. > because that's what 1-byte Unicode (UCS-1, if you will) is. If you have > an issue with that, take it up with the designers of Unicode. The designers of Unicode have never created a standard "1-byte Unicode" or UCS-1, as far as I can determine. The Unicode standard refers to some multiple million code points, far too many to fit in a single byte. There is some historical justification for using "Unicode" to mean UCS-2, but with the standard being extended beyond the BMP, that is no longer valid. See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details. I think what you are trying to say is that the Unicode designers deliberately matched the Latin-1 standard for Unicode's first 256 code points. That's not the same thing though: there is no Unicode standard mapping to a single byte format. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote: > As I understand (I think) the undelying mechanism, I can only say, it is > not a surprise that it happens. > > Imagine an editor, I type an "a", internally the text is saved as ascii, > then I type en "é", the text can only be saved in at least latin-1. Then > I enter an "€", the text become an internal ucs-4 "string". The remove > the "€" and so on. Firstly, that is not what Python does. For starters, € is in the BMP, and so is nearly every character you're ever going to use unless you are Asian or a historian using some obscure ancient script. NONE of the examples you have shown in your emails have included 4-byte characters, they have all been ASCII or UCS-2. You are suffering from a misunderstanding about what is going on and misinterpreting what you have seen. In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. That will not change. There is a tiny amount of fixed overhead for strings, and that overhead is slightly different between the versions, but you'll never notice the difference. Secondly, how a text editor or word processor chooses to store the text that you type is not the same as how Python does it. A text editor is not going to be creating a new immutable string after every key press. That will be slow slow SLOW. The usual way is to keep a buffer for each paragraph, and add and subtract characters from the buffer. > Intuitively I expect there is some kind slow down between all these > "strings" conversion. Your intuition is wrong. Strings are not converted from ASCII to USC-2 to USC-4 on the fly, they are converted once, when the string is created. The tests we ran earlier, e.g.: ('ab…' * 1000).replace('…', 'œ…') show the *worst possible case* for the new string handling, because all we do is create new strings. First we create a string 'ab…', then we create another string 'ab…'*1000, then we create two new strings '…' and 'œ…', and finally we call replace and create yet another new string. But in real applications, once you have created a string, you don't just immediately create a new one and throw the old one away. You likely do work with that string: steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))" 10 loops, best of 3: 2.41 usec per loop steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))" 10 loops, best of 3: 2.29 usec per loop Once you start doing *real work* with the strings, the overhead of deciding whether they should be stored using 1, 2 or 4 bytes begins to fade into the noise. > When I tested this flexible representation, a few months ago, at the > first alpha release. This is precisely what, I tested. String > manipulations which are forcing this internal change and I concluded the > result is not brillant. Realy, a factor 0.n up to 10. Like I said, if you really think that there is a significant, repeatable slow-down on Windows, report it as a bug. > Does any body know a way to get the size of the internal "string" in > bytes? sys.getsizeof(some_string) steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size ('abcœ…'*1000))" 10030 steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size ('abcœ…'*1000))" 10038 As I said, there is a *tiny* overhead difference. But identifiers will generally be smaller: steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size (size.__name__))" 48 steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size (size.__name__))" 34 You can check the object overhead by looking at the size of the empty string. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote: >> > I'm aware of this (and all the blah blah blah you are explaining). >> > This always the same song. Memory. >> >> >> >> Exactly. The reason it is always the same song is because it is an >> important song. >> >> > No offense here. But this is an *american* answer. I am not American. I am not aware that computers outside of the USA, and Australia, have unlimited amounts of memory. You must be very lucky. > The same story as the coding of text files, where "utf-8 == ascii" and > the rest of the world doesn't count. UTF-8 is not ASCII. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with "To recap". On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote: > Steven D'Aprano writes: >> (There is an extension to UCS-2, UTF-16, which encodes non-BMP >> characters using two code points. This is fragile and doesn't work very >> well, because string-handling methods can break the surrogate pairs >> apart, leaving you with invalid unicode string. Not good.) > ... >> With PEP 393, each Python string will be stored in the most efficient >> format possible: > > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Forget encodings! We're not talking about encodings. Encodings are used for converting text as bytes for transmission over the wire or storage on disk. PEP 393 talks about the internal representation of text within Python, the C-level data structure. In 3.2, that data structure depends on a compile-time switch. In a "narrow build", text is stored using two-bytes per character, so the string "len" (as in the name of the built-in function) will be stored as 006c 0065 006e (or possibly 6c00 6500 6e00, depending on whether your system is LittleEndian or BigEndian), plus object-overhead, which I shall ignore. Since most identifiers are ASCII, that's already using twice as much memory as needed. This standard data structure is called UCS-2, and it only handles characters in the Basic Multilingual Plane, the BMP (roughly the first 64000 Unicode code points). I'll come back to that. In a "wide build", text is stored as four-bytes per character, so "len" is stored as either: 006c 0065 006e 6c00 6500 6e00 Now memory is cheap, but it's not *that* cheap, and no matter how much memory you have, you can always use more. This system is called UCS-4, and it can handle the entire Unicode character set, for now and forever. (If we ever need more that four-bytes worth of characters, it won't be called Unicode.) Remember I said that UCS-2 can only handle the 64K characters [technically: code points] in the Basic Multilingual Plane? There's an extension to UCS-2 called UTF-16 which extends it to the entire Unicode range. Yes, that's the same name as the UTF-16 encoding, because it's more or less the same system. UTF-16 says "let's represent characters in the BMP by two bytes, but characters outside the BMP by four bytes." There's a neat trick to this: the BMP doesn't use the entire two-byte range, so there are some byte pairs which are illegal in UCS-2 -- they don't correspond to *any* character. UTF-16 used those byte pairs to signal "this is half a character, you need to look at the next pair for the rest of the character". Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs". Except this comes at a big cost: you can no longer tell how long a string is by counting the number of bytes, which is fast, because sometimes four bytes is two characters and sometimes it's one and you can't tell which it will be until you actually inspect all four bytes. Copying sub-strings now becomes either slow, or buggy. Say you want to grab the 10th characters in a string. The fast way using UCS-2 is to simply grab bytes 8 and 9 (remember characters are pairs of bytes and we start counting at zero) and you're done. Fast and safe if you're willing to give up the non-BMP characters. It's also fast and safe if you use USC-4, but then everything takes twice as much space, so you probably end up spending so much time copying null bytes that you're probably slower anyway. Especially when your OS starts paging memory like mad. But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 and 9 are half of a surrogate pair, and you've now split the pair and ended up with an invalid string. That's what Python 3.2 does, it fails to handle surrogate pairs properly: py> s = chr(0x + 1) py> a, b = s py> a '\ud800' py> b '\udc00' I've just split a single valid Unicode character into two invalid characters. Python3.2 will (probably) mindless process those two non- characters, and the only sign I have that I did something wrong is that my data is now junk. Since any character can be a surrogate pair, you have to scan every pair of bytes in order to index a string, or work out it's length, or copy a substring. It's not enough to just check if the last pair is a surrogate. When you don't, you have bugs like this from Python 3.2: py> s = "01234" + chr(0x + 1) + "6789" py> s[9] == '9' False py> s[9], len(s) ('8', 11) Which is now fixed in Python 3.3. So variable-width data structures like UTF-8 or UTF-16 are crap for the internal representation of strings -- they are either fast or correct but cannot be both. But UCS-2 is sub-optimal, because it can only handle the BMP, and
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > Generally, I'm working with pure ASCII, but port those same algorithms > to Python and you'll easily be able to read in a file in some known > encoding and manipulate it as Unicode. If it's pure ASCII, you can use the bytes or bytearray type. > It's not so much 'random access to the nth character' as an efficient > way of jumping forward. For instance, if I know that the next thing is > a literal string of n characters (that I don't care about), I want to > skip over that and keep parsing. I don't understand how this is supposed to work. You're going to read a large unicode text file (let's say it's UTF-8) into a single big string? So the runtime library has to scan the encoded contents to find the highest numbered codepoint (let's say it's mostly ascii but has a few characters outside the BMP), expand it all (in this case) to UCS-4 giving 4x memory bloat and requiring decoding all the UTF-8 regardless, and now we should worry about the efficiency of skipping n characters? Since you have to decode the n characters regardless, I'd think this skipping part should only be an issue if you have to do it a lot of times. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin wrote: > Chris Angelico writes: >> I don't have a Python example of parsing a huge string, but I've done >> it in other languages, and when I can depend on indexing being a cheap >> operation, I'll happily do exactly that. > > I'd be interested to know what the context was, where you parsed > a big unicode string in a way that required random access to > the nth character in the string. It's something I've done in C/C++ fairly often. Take one big fat buffer, slice it and dice it as you get the information you want out of it. I'll retain and/or calculate indices (when I'm not using pointers, but that's a different kettle of fish). Generally, I'm working with pure ASCII, but port those same algorithms to Python and you'll easily be able to read in a file in some known encoding and manipulate it as Unicode. It's not so much 'random access to the nth character' as an efficient way of jumping forward. For instance, if I know that the next thing is a literal string of n characters (that I don't care about), I want to skip over that and keep parsing. The Adobe Message Format is particularly noteworthy in this, but it's a stupid format and I don't recommend people spend too much time reading up on it (unless you like that sensation of your brain trying to escape through your ear). ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > Sure, four characters isn't a big deal to step through. But it still > makes indexing and slicing operations O(N) instead of O(1), plus you'd > have to zark the whole string up to where you want to work. I know some systems chop the strings into blocks of (say) a few hundred chars, so you can immediately get to the correct block, then scan into the block to get to the desired char offset. > I don't have a Python example of parsing a huge string, but I've done > it in other languages, and when I can depend on indexing being a cheap > operation, I'll happily do exactly that. I'd be interested to know what the context was, where you parsed a big unicode string in a way that required random access to the nth character in the string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 4:09 PM, Terry Reedy wrote: print(timeit("c in a", "c = '…'; a = 'a'*1000+c")) # .6 in 3.2.3, 1.2 in 3.3.0 This does not make sense to me and I will ask about it. I did ask on pydef list and paraphrased responses include: 1. 'My system gives opposite ratios.' 2. 'With a default of 100 repetitions in a loop, the reported times are microseconds per operation and thus not practically significant.' 3. 'There is a stringbench.py with a large number of such micro benchmarks.' I believe there are also whole-application benchmarks that try to mimic real-world mixtures of operations. People making improvements must consider performance on multiple systems and multiple benchmarks. If someone wants to work on search speed, they cannot just optimize that one operation on one system. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin wrote: > Chris Angelico writes: > "asdfqwer"[4:] >> 'qwer' >> >> That's a not uncommon operation when parsing strings or manipulating >> data. You'd need to completely rework your algorithms to maintain a >> position somewhere. > > Scanning 4 characters (or a few dozen, say) to peel off a token in > parsing a UTF-8 string is no big deal. It gets more expensive if you > want to index far more deeply into the string. I'm asking how often > that is done in real code. Obviously one can concoct hypothetical > examples that would suffer. Sure, four characters isn't a big deal to step through. But it still makes indexing and slicing operations O(N) instead of O(1), plus you'd have to zark the whole string up to where you want to work. It'd be workable, but you'd have to redo your algorithms significantly; I don't have a Python example of parsing a huge string, but I've done it in other languages, and when I can depend on indexing being a cheap operation, I'll happily do exactly that. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: "asdfqwer"[4:] > 'qwer' > > That's a not uncommon operation when parsing strings or manipulating > data. You'd need to completely rework your algorithms to maintain a > position somewhere. Scanning 4 characters (or a few dozen, say) to peel off a token in parsing a UTF-8 string is no big deal. It gets more expensive if you want to index far more deeply into the string. I'm asking how often that is done in real code. Obviously one can concoct hypothetical examples that would suffer. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin wrote: > Chris Angelico writes: >> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a >> few thousand bytes, how do you locate the 273rd character? > > How often do you need to do that, as opposed to traversing the string by > iteration? Anyway, you could use a rope-like implementation, or an > index structure over the string. Well, imagine if Python strings were stored in UTF-8. How would you slice it? >>> "asdfqwer"[4:] 'qwer' That's a not uncommon operation when parsing strings or manipulating data. You'd need to completely rework your algorithms to maintain a position somewhere. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico writes: > UTF-8 is highly inefficient for indexing. Given a buffer of (say) a > few thousand bytes, how do you locate the 273rd character? How often do you need to do that, as opposed to traversing the string by iteration? Anyway, you could use a rope-like implementation, or an index structure over the string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin wrote: > Can you explain the issue of "breaking surrogate pairs apart" a little > more? Switching between encodings based on the string contents seems > silly at first glance. Strings are immutable so I don't understand why > not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in > Latin-based alphabets and UTF-16 may be more efficient for some other > languages. I think even UCS-4 doesn't completely fix the surrogate pair > issue if it means the only thing I can think of. UTF-8 is highly inefficient for indexing. Given a buffer of (say) a few thousand bytes, how do you locate the 273rd character? You have to scan from the beginning. The same applies when surrogate pairs are used to represent single characters, unless the representation leaks and a surrogate is indexed as two - which is where the breaking-apart happens. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 21:22, wxjmfa...@gmail.com wrote: Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : On Aug 18, 10:59 pm, Steven D'Aprano wrote: On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Is there any reason why non ascii users are somehow penalized compared to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html I thing it's time to leave the discussion and to go to bed. In plain English, duck out cos I'm losing. You can take the problem the way you wish, Python 3.3 is "slower" than Python 3.2. I'll ask for the second time. Provide proof that is acceptable to everybody and not just yourself. If you see the present status as an optimisation, I'm condidering this as a regression. Considering does not equate to proof. Where are the figures which back up your claim? I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, the correct solution. I look forward to seeing your patch on the bug tracker. If and only if you can find something that needs patching, which from the course of this thread I think is highly unlikely. To be extreme, tools using pure utf-16 or utf-32 are, at least, considering all the citizen on this planet in the same way. jmf -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit : > On Aug 18, 10:59 pm, Steven D'Aprano > +comp.lang.pyt...@pearwood.info> wrote: > > > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > Is there any reason why non ascii users are somehow penalized compared > > > > to ascii users? > > > > > > Of course there is a reason. > > > > > > If you want to represent 1114111 different characters in a string, as > > > Unicode supports, you can't use a single byte per character, or even two > > > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > > > must be more expensive than supporting 128 of them. > > > > > > But why should you carry the cost of 4-bytes per character just because > > > someday you *might* need a non-BMP character? > > > > I am reminded of: > http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 > > > > Original above does not open for me but here's a copy that does: > > > > http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html I thing it's time to leave the discussion and to go to bed. You can take the problem the way you wish, Python 3.3 is "slower" than Python 3.2. If you see the present status as an optimisation, I'm condidering this as a regression. I'm pretty sure a pure ucs-4/utf-32 can only be, by nature, the correct solution. To be extreme, tools using pure utf-16 or utf-32 are, at least, considering all the citizen on this planet in the same way. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/18/2012 12:38 PM, wxjmfa...@gmail.com wrote: Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. You have not tried enough tests ;-). On my Win7-64 system: from timeit import timeit print(timeit(" 'a'*1 ")) 3.3.0b2: .5 3.2.3: .8 print(timeit("c in a", "c = '…'; a = 'a'*1")) 3.3: .05 (independent of len(a)!) 3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as high as one wants! print(timeit("a.encode()", "a = 'a'*1000")) 3.2: 1.5 3.3: .26 Similar with encoding='utf-8' added to call. Jim, please stop the ranting. It does not help improve Python. utf-32 is not a panacea; it has problems of time, space, and system compatibility (Windows and others). Victor Stinner, whatever he may have once thought and said, put a *lot* of effort into making the new implementation both correct and fast. On your replace example >>> imeit.timeit("('ab…' * 1000).replace('…', '……')") > 61.919225272152346 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") > 1.2918679017971044 I do not see the point of changing both length and replacement. For me, the time is about the same for either replacement. I do see about the same slowdown ratio for 3.3 versus 3.2 I also see it for pure search without replacement. print(timeit("c in a", "c = '…'; a = 'a'*1000+c")) # .6 in 3.2.3, 1.2 in 3.3.0 This does not make sense to me and I will ask about it. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:40, rusi wrote: On Aug 18, 10:59 pm, Steven D'Aprano wrote: On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Is there any reason why non ascii users are somehow penalized compared to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html ROFLMAO doesn't adequately some up how much I laughed. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:30, wxjmfa...@gmail.com wrote: Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : [...] The problem with UCS-4 is that every character requires four bytes. [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. No offense here. But this is an *american* answer. The same story as the coding of text files, where "utf-8 == ascii" and the rest of the world doesn't count. jmf Thinking about it I entirely agree with you. Steven D'Aprano strikes me as typically American, in the same way that I'm typically Brazilian :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:26, Paul Rubin wrote: Steven D'Aprano writes: (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters using two code points. This is fragile and doesn't work very well, because string-handling methods can break the surrogate pairs apart, leaving you with invalid unicode string. Not good.) ... With PEP 393, each Python string will be stored in the most efficient format possible: Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of. On a narrow build, codepoints outside the BMP are stored as a surrogate pair (2 codepoints). On a wide build, all codepoints can be represented without the need for surrogate pairs. The problem with strings containing surrogate pairs is that you could inadvertently slice the string in the middle of the surrogate pair. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Aug 18, 10:59 pm, Steven D'Aprano wrote: > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > Is there any reason why non ascii users are somehow penalized compared > > to ascii users? > > Of course there is a reason. > > If you want to represent 1114111 different characters in a string, as > Unicode supports, you can't use a single byte per character, or even two > bytes. That is a fact of basic mathematics. Supporting 1114111 characters > must be more expensive than supporting 128 of them. > > But why should you carry the cost of 4-bytes per character just because > someday you *might* need a non-BMP character? I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605 Original above does not open for me but here's a copy that does: http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit : > On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > > > > > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > > >> [...] > > >> The problem with UCS-4 is that every character requires four bytes. > > >> [...] > > > > > > I'm aware of this (and all the blah blah blah you are explaining). This > > > always the same song. Memory. > > > > Exactly. The reason it is always the same song is because it is an > > important song. > > No offense here. But this is an *american* answer. The same story as the coding of text files, where "utf-8 == ascii" and the rest of the world doesn't count. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 19:05, wxjmfa...@gmail.com wrote: Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit : Proof that is acceptable to everybody please, not just yourself. I cann't, I'm only facing the fact it works slower on my Windows platform. As I understand (I think) the undelying mechanism, I can only say, it is not a surprise that it happens. Imagine an editor, I type an "a", internally the text is saved as ascii, then I type en "é", the text can only be saved in at least latin-1. Then I enter an "€", the text become an internal ucs-4 "string". The remove the "€" and so on. [snip] "a" will be stored as 1 byte/codepoint. Adding "é", it will still be stored as 1 byte/codepoint. Adding "€", it will still be stored as 2 bytes/codepoint. But then you wouldn't be adding them one at a time in Python, you'd be building a list and then joining them together in one operation. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano writes: > (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters > using two code points. This is fragile and doesn't work very well, > because string-handling methods can break the surrogate pairs apart, > leaving you with invalid unicode string. Not good.) ... > With PEP 393, each Python string will be stored in the most efficient > format possible: Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit : > > Proof that is acceptable to everybody please, not just yourself. > > I cann't, I'm only facing the fact it works slower on my Windows platform. As I understand (I think) the undelying mechanism, I can only say, it is not a surprise that it happens. Imagine an editor, I type an "a", internally the text is saved as ascii, then I type en "é", the text can only be saved in at least latin-1. Then I enter an "€", the text become an internal ucs-4 "string". The remove the "€" and so on. Intuitively I expect there is some kind slow down between all these "strings" conversion. When I tested this flexible representation, a few months ago, at the first alpha release. This is precisely what, I tested. String manipulations which are forcing this internal change and I concluded the result is not brillant. Realy, a factor 0.n up to 10. This are simply my conclusions. Related question. Does any body know a way to get the size of the internal "string" in bytes? In the narrow or wide build it is easy, I can encode with the "unicode_internal" codec. In Py 3.3, I attempted to toy with sizeof and stuct, but without success. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are explaining). This > always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. > Let me ask. Is Python an 'american" product for us-users or is it a tool > for everybody [*]? It is a product for everyone, which is exactly why PEP 393 is so important. PEP 393 means that users who have only a few non-BMP characters don't have to pay the cost of UCS-4 for every single string in their application, only for the ones that actually require it. PEP 393 means that using Unicode strings is now cheaper for everybody. You seem to be arguing that the way forward is not to make Unicode cheaper for everyone, but to make ASCII strings more expensive so that everyone suffers equally. I reject that idea. > Is there any reason why non ascii users are somehow penalized compared > to ascii users? Of course there is a reason. If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them. But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character? > This flexible string representation is a regression (ascii users or > not). No it is not. It is a great step forward to more efficient Unicode. And it means that now Python can correctly deal with non-BMP characters without the nonsense of UTF-16 surrogates: steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right! 1 steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong! 2 without doubling the storage of every string. This is an important step towards making the full range of Unicode available more widely. > I recognize in practice the real impact is for many users closed to zero Then what's the problem? > (including me) but I have shown (I think) that this flexible > representation is, by design, not as optimal as it is supposed to be. You have not shown any real problem at all. You have shown untrustworthy, edited timing results that don't match what other people are reporting. Even if your timing results are genuine, you haven't shown that they make any difference for real code that does useful work. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 17:38, wxjmfa...@gmail.com wrote: Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. Proof that is acceptable to everybody please, not just yourself. Now, the reason. I think it is due the "flexible represention". Deeper reason. The "boss" do not wish to hear from a (pure) ucs-4/utf-32 "engine" (this has been discussed I do not know how many times). jmf -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 2:38 AM, wrote: > Sorry guys, I'm not stupid (I think). I can open IDLE with > Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is > always slower. Period. Ah, but what about all those other operations that use strings under the covers? As mentioned, namespace lookups do, among other things. And how is performance in the (very real) case where a C routine wants to return a value to Python as a string, where the data is currently guaranteed to be ASCII (previously using PyUnicode_FromString, now able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been gone into in great detail before the PEP was accepted (am I negative-bikeshedding here? "atomic reactoring"???), and I'm sure that the gains outweigh the costs. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Sorry guys, I'm not stupid (I think). I can open IDLE with Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is always slower. Period. Now, the reason. I think it is due the "flexible represention". Deeper reason. The "boss" do not wish to hear from a (pure) ucs-4/utf-32 "engine" (this has been discussed I do not know how many times). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, Aug 18, 2012 at 9:07 AM, wrote: > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : >> [...] >> The problem with UCS-4 is that every character requires four bytes. >> [...] > > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? The change does not just benefit ASCII users. It primarily benefits anybody using a wide unicode build with strings mostly containing only BMP characters. Even for narrow build users, there is the benefit that with approximately the same amount of memory usage in most cases, they no longer have to worry about non-BMP characters sneaking in and breaking their code. There is some additional benefit for Latin-1 users, but this has nothing to do with Python. If Python is going to have the option of a 1-byte representation (and as long as we have the flexible representation, I can see no reason not to), then it is going to be Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if you will) is. If you have an issue with that, take it up with the designers of Unicode. > > This flexible string representation is a regression (ascii users > or not). > > I recognize in practice the real impact is for many users > closed to zero (including me) but I have shown (I think) that > this flexible representation is, by design, not as optimal > as it is supposed to be. This is in my mind the relevant point. You've shown nothing of the sort. You've demonstrated only one out of many possible benchmarks, and other users on this list can't even reproduce that. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 1:07 AM, wrote: > I'm aware of this (and all the blah blah blah you are > explaining). This always the same song. Memory. > > Let me ask. Is Python an 'american" product for us-users > or is it a tool for everybody [*]? > Is there any reason why non ascii users are somehow penalized > compared to ascii users? Regardless of your own native language, "len" is the name of a popular Python function. And "dict" is a well-used class. Both those names are representable in ASCII, even if every quoted string in your code requires more bytes to store. And memory usage has significance in many other areas, too. CPU cache utilization turns a space saving into a time saving. That's why structure packing still exists, even though member alignment has other advantages. You'd be amazed how many non-USA strings still fit inside seven bits, too. Are you appending a space to something? Splitting on newlines? You'll have lots of strings that are going now to be space-optimized. Of course, the performance gains from shortening some of the strings may be offset by costs when comparing one-byte and multi-byte strings, but presumably that's all been gone into in great detail elsewhere. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 18/08/2012 16:07, wxjmfa...@gmail.com wrote: Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : [...] The problem with UCS-4 is that every character requires four bytes. [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Let me ask. Is Python an 'american" product for us-users or is it a tool for everybody [*]? Is there any reason why non ascii users are somehow penalized compared to ascii users? This flexible string representation is a regression (ascii users or not). I recognize in practice the real impact is for many users closed to zero (including me) but I have shown (I think) that this flexible representation is, by design, not as optimal as it is supposed to be. This is in my mind the relevant point. [*] This not even true, if we consider the €uro currency symbol used all around the world (banking, accounting applications). jmf Sorry but you've got me completely baffled. Could you please explain in words of one syllable or less so I can attempt to grasp what the hell you're on about? -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
(Resending this to the list because I previously sent it only to Steven by mistake. Also showing off a case where top-posting is reasonable, since this bit requires no context. :-) On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly wrote: > > On Aug 17, 2012 10:17 PM, "Steven D'Aprano" > wrote: >> >> Unicode strings are not represented as Latin-1 internally. Latin-1 is a >> byte encoding, not a unicode internal format. Perhaps you mean to say >> that they are represented as a single byte format? > > They are represented as a single-byte format that happens to be equivalent > to Latin-1, because Latin-1 is a proper subset of Unicode; every character > representable in Latin-1 has a byte value equal to its Unicode codepoint. > This talk of whether it's a byte encoding or a 1-byte Unicode representation > is then just semantics. Even the PEP refers to the 1-byte representation as > Latin-1. > >> >> >> I understand the complaint >> >> to be that while the change is great for strings that happen to fit in >> >> Latin-1, it is less efficient than previous versions for strings that >> >> do not. >> > >> > That's not the way I interpreted the PEP 393. It takes a pure unicode >> > string, finds the largest code point in that string, and chooses 1, 2 or >> > 4 bytes for every character, based on how many bits it'd take for that >> > largest code point. >> >> That's how I interpret it too. > > I don't see how this is any different from what I described. Using all 4 > bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get > UCS-2. Truncating to 1 byte, you get Latin-1. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit : > [...] > The problem with UCS-4 is that every character requires four bytes. > [...] I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Let me ask. Is Python an 'american" product for us-users or is it a tool for everybody [*]? Is there any reason why non ascii users are somehow penalized compared to ascii users? This flexible string representation is a regression (ascii users or not). I recognize in practice the real impact is for many users closed to zero (including me) but I have shown (I think) that this flexible representation is, by design, not as optimal as it is supposed to be. This is in my mind the relevant point. [*] This not even true, if we consider the €uro currency symbol used all around the world (banking, accounting applications). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: sys.version > '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]' timeit.timeit("('ab…' * 1000).replace('…', '……')") > 37.32762490493721 > timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764 > sys.version > '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 > bit (Intel)]' imeit.timeit("('ab…' * 1000).replace('…', '……')") > 61.919225272152346 "imeit"? It is hard to take your results seriously when you have so obviously edited your timing results, not just copied and pasted them. Here are my results, on my laptop running Debian Linux. First, testing on Python 3.2: steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 50.2 usec per loop steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 45.3 usec per loop steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 51.3 usec per loop steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 47.6 usec per loop steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 45.9 usec per loop steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 57.5 usec per loop steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')" 1 loops, best of 3: 49.7 usec per loop As you can see, the timing results are all consistently around 50 microseconds per loop, regardless of which characters I use, whether they are in Latin-1 or not. The differences between one test and another are not meaningful. Now I do them again using Python 3.3: steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')" 1 loops, best of 3: 64.3 usec per loop steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')" 1 loops, best of 3: 67.8 usec per loop steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')" 1 loops, best of 3: 66 usec per loop steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')" 1 loops, best of 3: 67.6 usec per loop steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')" 1 loops, best of 3: 68.3 usec per loop steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')" 1 loops, best of 3: 67.9 usec per loop steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')" 1 loops, best of 3: 66.9 usec per loop The results are all consistently around 67 microseconds. So Python's string handling is about 30% slower in the examples show here. If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. > My take of the subject. > > This is a typical Python desease. Do not solve a problem, but find a > way, a workaround, which is expecting to solve a problem and which > finally solves nothing. As far as I know, to break the "BMP limit", the > tools are here. They are called utf-8 or ucs-4/utf-32. The problem with UCS-4 is that every character requires four bytes. Every. Single. One. So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus the object overhead. Under UCS-2, it takes half that space: 22 bytes, but of course UCS-2 can only represent characters in the BMP. A pure ASCII string would only take 11 bytes, but we're not going back to pure ASCII. (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters using two code points. This is fragile and doesn't work very well, because string-handling methods can break the surrogate pairs apart, leaving you with invalid unicode string. Not good.) The difference between 44 bytes and 22 bytes for one little string is not very important, but when you double the memory required for every single string it becomes huge. Remember that every class, function and method has a name, which is a string; every attribute and variable has a name, all strings; functions and classes have doc strings, all strings. Strings are used everywhere in Python, and doubling the memory needed by Python means that it will perform worse. With PEP 393, each Python string will be stored in the most efficient format possible: - if it only contains ASCII characters, it will be stored using 1 byte per character; - if it only contains characters in the BMP, it will be stored using UCS-2 (2 bytes per character); - if it contains non-BMP characters, the string will be stored using UCS-4 (4 bytes per character). -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
>>> sys.version '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]' >>> timeit.timeit("('ab…' * 1000).replace('…', '……')") 37.32762490493721 timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764 >>> sys.version '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit (Intel)]' >>> imeit.timeit("('ab…' * 1000).replace('…', '……')") 61.919225272152346 >>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 1.2918679017971044 timeit.timeit("('ab…' * 10).replace('…', '€…')") 1.2484133226156757 * I intuitively and empirically noticed, this happens for cp1252 or mac-roman characters and not characters which are elements of the latin-1 coding scheme. * Bad luck, such characters are usual characters in French scripts (and in some other European language). * I do not recall the extreme cases I found. Believe me, when I'm speaking about a few 100%, I do not lie. My take of the subject. This is a typical Python desease. Do not solve a problem, but find a way, a workaround, which is expecting to solve a problem and which finally solves nothing. As far as I know, to break the "BMP limit", the tools are here. They are called utf-8 or ucs-4/utf-32. One day, I fell on very, very old mail message, dating at the time of the introduction of the unicode type in Python 2. If I recall correctly it was from Victor Stinner. He wrote something like this "Let's go with ucs-4, and the problems are solved for ever". He was so right. I'm spying the dev-list since years, my feeling is that there is always a latent and permanent conflict between "ascii users" and "non ascii users" (see the unicode literal reintroduction). Please, do not get me wrong. As a non-computer scientist, I'm very happy with Python. If I try to take a distant eye, I became more and more sceptical. PS Py3.3b2 is still crashing, silently exiting, with cp65001. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote: > On 08/17/2012 08:21 PM, Ian Kelly wrote: >> On Aug 17, 2012 2:58 PM, "Dave Angel" wrote: >>> The internal coding described in PEP 393 has nothing to do with >>> latin-1 encoding. >> It certainly does. PEP 393 provides for Unicode strings to be >> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is >> smallest and sufficient to contain the data. Unicode strings are not represented as Latin-1 internally. Latin-1 is a byte encoding, not a unicode internal format. Perhaps you mean to say that they are represented as a single byte format? >> I understand the complaint >> to be that while the change is great for strings that happen to fit in >> Latin-1, it is less efficient than previous versions for strings that >> do not. > > That's not the way I interpreted the PEP 393. It takes a pure unicode > string, finds the largest code point in that string, and chooses 1, 2 or > 4 bytes for every character, based on how many bits it'd take for that > largest code point. That's how I interpret it too. > Further i read it to mean that only 00 bytes would > be dropped in the process, no other bytes would be changed. Just to clarify, you aren't talking about the \0 character, but only to extraneous "padding" 00 bytes. > I also figure this is going to be more space efficient than Python 3.2 > for any string which had a max code point of 65535 or less (in Windows), > or 4billion or less (in real systems). So unless French has code points > over 64k, I can't figure that anything is lost. I think that on narrow builds, it won't make terribly much difference. The big savings are for wide builds. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote: > Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : >> On Fri, Aug 17, 2012 at 1:49 PM, wrote: >> >> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS', >> > is one of these characters existing in the cp1252, mac-roman >> > coding schemes and not in iso-8859-1 (latin-1) and obviously >> > not in ascii. It causes Py3.3 to work a few 100% slower >> > than Py<3.3 versions due to the flexible string representation >> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%). [...] > Sorry, you missed the point. > > My comment had nothing to do with the code source coding, the coding of > a Python "string" in the code source or with the display of a Python3 > . > I wrote about the *internal* Python "coding", the way Python keeps > "strings" in memory. See PEP 393. The PEP does not support your claim that flexible string storage is 100% to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60% of the memory used for strings. I don't really understand what message you are trying to give here. Are you saying that PEP 393 is a good thing or a bad thing? In Python 1.x, there was no support for Unicode at all. You could only work with pure byte strings. Support for non-ascii characters like … ∞ é ñ £ π Ж ش was purely by accident -- if your terminal happened to be set to an encoding that supported a character, and you happened to use the appropriate byte value, you might see the character you wanted. In Python 2.2, Python gained support for Unicode. You could now guarantee support for any Unicode character in the Basic Multilingual Plane (BMP) by writing your strings using the u"..." style. In Python 3, you no longer need the leading U, all strings are unicode. But there is a problem: if your Python interpreter is a "narrow build", it *only* supports Unicode characters in the BMP. When Python is a "wide build", compiled with support for the additional character planes, then strings take much more memory, even if they are in the BMP, or are simple ASCII strings. PEP 393 fixes this problem and gets rid of the distinction between narrow and wide builds. From Python 3.3 onwards, all Python compilers will have the same support for unicode, rather than most being BMP-only. Each individual string's internal storage will use only as many bytes-per- character as needed to store the largest character in the string. This will save a lot of memory for those using mostly ASCII or Latin-1 but a few multibyte characters. While the increased complexity causes a small slowdown, the increased functionality makes it well worthwhile. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 08/17/2012 08:21 PM, Ian Kelly wrote: > On Aug 17, 2012 2:58 PM, "Dave Angel" wrote: >> The internal coding described in PEP 393 has nothing to do with latin-1 >> encoding. > It certainly does. PEP 393 provides for Unicode strings to be represented > internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and > sufficient to contain the data. I understand the complaint to be that while > the change is great for strings that happen to fit in Latin-1, it is less > efficient than previous versions for strings that do not. That's not the way I interpreted the PEP 393. It takes a pure unicode string, finds the largest code point in that string, and chooses 1, 2 or 4 bytes for every character, based on how many bits it'd take for that largest code point. Further i read it to mean that only 00 bytes would be dropped in the process, no other bytes would be changed. I take it as a coincidence that it happens to match latin-1; that's the way Unicode happened historically, and is not Python's fault. Am I reading it wrong? I also figure this is going to be more space efficient than Python 3.2 for any string which had a max code point of 65535 or less (in Windows), or 4billion or less (in real systems). So unless French has code points over 64k, I can't figure that anything is lost. I have no idea about the times involved, so i wanted a more specific complaint. > I don't know how much merit there is to this claim. It would seem to me > that even in non-western locales, most strings are likely to be Latin-1 or > even ASCII, e.g. class and attribute and function names. > > The jmfauth rant I was responding to was saying that French isn't efficiently encoded, and that performance of some vague operations were somehow reduced by several fold. I was just trying to get him to be more specific. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Aug 17, 2012 2:58 PM, "Dave Angel" wrote: > > The internal coding described in PEP 393 has nothing to do with latin-1 > encoding. It certainly does. PEP 393 provides for Unicode strings to be represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and sufficient to contain the data. I understand the complaint to be that while the change is great for strings that happen to fit in Latin-1, it is less efficient than previous versions for strings that do not. I don't know how much merit there is to this claim. It would seem to me that even in non-western locales, most strings are likely to be Latin-1 or even ASCII, e.g. class and attribute and function names. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 08/17/2012 02:45 PM, wxjmfa...@gmail.com wrote: > Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : >> >> >> I don't understand what any of this has to do with Python. Just >> >> output your text in UTF-8 like any civilized person in the 21st >> >> century, and none of that is a problem at all. Python make that easy. >> >> It also makes it easy to interoperate with older encodings if you >> >> have to. >> > Sorry, you missed the point. > > My comment had nothing to do with the code source coding, > the coding of a Python "string" in the code source or with > the display of a Python3 . > I wrote about the *internal* Python "coding", the > way Python keeps "strings" in memory. See PEP 393. > > jmf The internal coding described in PEP 393 has nothing to do with latin-1 encoding. So what IS your point? Make it clearly, without all the snide side-comments. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : > On Fri, Aug 17, 2012 at 1:49 PM, wrote: > > > The character '…', Unicode name 'HORIZONTAL ELLIPSIS', > > > is one of these characters existing in the cp1252, mac-roman > > > coding schemes and not in iso-8859-1 (latin-1) and obviously > > > not in ascii. It causes Py3.3 to work a few 100% slower > > > than Py<3.3 versions due to the flexible string representation > > > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%). > > > > > '…'.encode('cp1252') > > > b'\x85' > > '…'.encode('mac-roman') > > > b'\xc9' > > '…'.encode('iso-8859-1') # latin-1 > > > Traceback (most recent call last): > > > File "", line 1, in > > > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' > > > in position 0: ordinal not in range(256) > > > > > > If one could neglect this (typographically important) glyph, what > > > to say about the characters of the European scripts (languages) > > > present in cp1252 or in mac-roman but not in latin-1 (eg. the > > > French script/language)? > > > > So... python should change the longstanding definition of the latin-1 > > character set? This isn't some sort of python limitation, it's just > > the reality of legacy encodings that actually exist in the real world. > > > > > > > Very nice. Python 2 was built for ascii user, now Python 3 is > > > *optimized* for, let say, ascii user! > > > > > > The future is bright for Python. French users are better > > > served with Apple or MS products, simply because these > > > corporates know you can not write French with iso-8859-1. > > > > > > PS When "TeX" moved from the ascii encoding to iso-8859-1 > > > and the so called Cork encoding, "they" know this and provided > > > all the complementary packages to circumvent this. It was > > > in 199? (Python was not even born). > > > > > > Ditto for the foundries (Adobe, Linotype, ...) > > > > > > I don't understand what any of this has to do with Python. Just > > output your text in UTF-8 like any civilized person in the 21st > > century, and none of that is a problem at all. Python make that easy. > > It also makes it easy to interoperate with older encodings if you > > have to. > Sorry, you missed the point. My comment had nothing to do with the code source coding, the coding of a Python "string" in the code source or with the display of a Python3 . I wrote about the *internal* Python "coding", the way Python keeps "strings" in memory. See PEP 393. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Fri, Aug 17, 2012 at 1:49 PM, wrote: > The character '…', Unicode name 'HORIZONTAL ELLIPSIS', > is one of these characters existing in the cp1252, mac-roman > coding schemes and not in iso-8859-1 (latin-1) and obviously > not in ascii. It causes Py3.3 to work a few 100% slower > than Py<3.3 versions due to the flexible string representation > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%). > '…'.encode('cp1252') > b'\x85' '…'.encode('mac-roman') > b'\xc9' '…'.encode('iso-8859-1') # latin-1 > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' > in position 0: ordinal not in range(256) > > If one could neglect this (typographically important) glyph, what > to say about the characters of the European scripts (languages) > present in cp1252 or in mac-roman but not in latin-1 (eg. the > French script/language)? So... python should change the longstanding definition of the latin-1 character set? This isn't some sort of python limitation, it's just the reality of legacy encodings that actually exist in the real world. > Very nice. Python 2 was built for ascii user, now Python 3 is > *optimized* for, let say, ascii user! > > The future is bright for Python. French users are better > served with Apple or MS products, simply because these > corporates know you can not write French with iso-8859-1. > > PS When "TeX" moved from the ascii encoding to iso-8859-1 > and the so called Cork encoding, "they" know this and provided > all the complementary packages to circumvent this. It was > in 199? (Python was not even born). > > Ditto for the foundries (Adobe, Linotype, ...) I don't understand what any of this has to do with Python. Just output your text in UTF-8 like any civilized person in the 21st century, and none of that is a problem at all. Python make that easy. It also makes it easy to interoperate with older encodings if you have to. -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit : > a = '…' > > print(ord(a)) > > >>> > > 8230 > > Most things with unicode are easier in 3.x, and some are even better in > > 3.3. The current beta is good enough for most informal work. 3.3.0 will > > be out in a month. > > > > -- > > Terry Jan Reedy Slightly off topic. The character '…', Unicode name 'HORIZONTAL ELLIPSIS', is one of these characters existing in the cp1252, mac-roman coding schemes and not in iso-8859-1 (latin-1) and obviously not in ascii. It causes Py3.3 to work a few 100% slower than Py<3.3 versions due to the flexible string representation (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%). >>> '…'.encode('cp1252') b'\x85' >>> '…'.encode('mac-roman') b'\xc9' >>> '…'.encode('iso-8859-1') # latin-1 Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position 0: ordinal not in range(256) If one could neglect this (typographically important) glyph, what to say about the characters of the European scripts (languages) present in cp1252 or in mac-roman but not in latin-1 (eg. the French script/language)? Very nice. Python 2 was built for ascii user, now Python 3 is *optimized* for, let say, ascii user! The future is bright for Python. French users are better served with Apple or MS products, simply because these corporates know you can not write French with iso-8859-1. PS When "TeX" moved from the ascii encoding to iso-8859-1 and the so called Cork encoding, "they" know this and provided all the complementary packages to circumvent this. It was in 199? (Python was not even born). Ditto for the foundries (Adobe, Linotype, ...) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Thu, 16 Aug 2012 15:09:47 -0700, Charles Jensen wrote: > Everyone knows that the python command > > ord(u'…') > > will output the number 8230 which is the unicode character for the > horizontal ellipsis. > > How would I use ord() to find the unicode value of a string stored in a > variable? > > So the following 2 lines of code will give me the ascii value of the > variable a. How do I specify ord to give me the unicode value of a? > > a = '…' ord(a) the same way you did in your original example by defining the string ass unicode a=u'...' ord(a) -- Keep on keepin' on. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
a = '…' print(ord(a)) >>> 8230 Most things with unicode are easier in 3.x, and some are even better in 3.3. The current beta is good enough for most informal work. 3.3.0 will be out in a month. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 08/16/2012 06:09 PM, Charles Jensen wrote: > Everyone knows that the python command > > ord(u'…') > > will output the number 8230 which is the unicode character for the horizontal > ellipsis. > > How would I use ord() to find the unicode value of a string stored in a > variable? > > So the following 2 lines of code will give me the ascii value of the variable > a. How do I specify ord to give me the unicode value of a? > > a = '…' > ord(a) You omitted the print statement. You also didn't specify what version of Python you're using; I'll assume Python 2.x because in Python 3.x, the u"xx" notation would have been a syntax error. To get the ord of a unicode variable, you do it the same as a unicode literal: a = u"j" #note: for this to work reliably, you probably need the correct Unicode declaration in line 2 of the file print ord(a) But if you have a byte string containing some binary bits, and you want to get a unicode character value out of it, you'll need to explicitly convert it to unicode. First, decide what method the byte string was encoded. If you specify the wrong encoding, you'll likely to get an exception, or maybe just a nonsense answer. a = "\xc1\xc1"#I just made this value up; it's not valid utf8 b = a.decode("utf-8") print ord(b) -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen wrote: > How would I use ord() to find the unicode value of a string stored in a > variable? > > So the following 2 lines of code will give me the ascii value of the variable > a. How do I specify ord to give me the unicode value of a? > > a = '…' > ord(a) I presume you're talking about Python 2, because in Python 3 your string variable is a Unicode string and will behave as you describe above. You'll need to look into what the encoding is, and figure it out from there. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
How do I display unicode value stored in a string variable using ord()
Everyone knows that the python command ord(u'…') will output the number 8230 which is the unicode character for the horizontal ellipsis. How would I use ord() to find the unicode value of a string stored in a variable? So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a? a = '…' ord(a) -- http://mail.python.org/mailman/listinfo/python-list