Re: [Python-Dev] Internal representation of strings and Micropython
Glenn Linderman writes: 3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so next/previous codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: for ix in range( len( str_x )): func( str_x[ ix ]).) Been there, tried that (Emacsen). Either it's a YAGNI (moving forward or backward over UTF-8 by characters short distances is plenty fast, especially if you've got a lot of ASCII you can move by words for somewhat longer distances), or it's not good enough. There *may* be a sweet spot, but it's definitely smaller than the one on Sharapova's racket. 4) (Fixed size caches) N entries, one for the last codepoint, and others at Codepoint_Length/N intervals. N could be tunable. To achieve space saving, cache has to be quite small, and the bigger your integers, the smaller it gets. A naive implementation on 64-bit machine would give you 16 bytes/cache entry. Using a non-native size will be a space win, but needs care in implementation. Initializing the cache is very expensive for small strings, so you need conditional and maybe lazy initialization (for large strings). By the way, there's also 10) Keep counts of the leading and trailing number of ASCII (one-octet) characters. This is often a *huge* win; it's quite common to encounter documents where size - lc - tc = 2 (ie, there's only one two-octet character in the document). 11) Keep a list (or tree) of most-recently-accessed positions. Despite my negative experience with multibyte encodings in Emacsen, I'm persuaded by the arguments that there probably aren't all that many places in core Python where indexing is used in an essential way, so MicroPython itself can probably optimize those behind the scenes. Application programmers in the embedded context may be expected to be deal with the need to avoid random access algorithms and use iterators and generators to accomplish most tasks. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
05.06.14 03:03, Greg Ewing написав(ла): Serhiy Storchaka wrote: html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't use iterators. They use indices, str.find and/or regular expressions. Common use case is quickly find substring starting from current position using str.find or re.search, process found token, advance position and repeat. For that kind of thing, you don't need an actual character index, just some way of referring to a place in a string. Of course. But _existing_ Python interfaces all work with indices. And it is too late to change this, this train was gone 20 years ago. There is no need in yet one way to do string operations. One obvious way is enough. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
04.06.14 23:50, Glenn Linderman написав(ла): 3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so next/previous codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: for ix in range( len( str_x )): func( str_x[ ix ]).) Great idea! It should cover most real-word cases. Note that we can scan UTF-8 string left-to-right and right-to-left. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Paul Sokolovsky writes: Please put that in perspective when alarming over O(1) indexing of inherently problematic niche datatype. (Again, it's not my or MicroPython's fault that it was forced as standard string type. Maybe if CPython seriously considered now-standard UTF-8 encoding, results of what is str type might be different. But CPython has gigabytes of heap to spare, and for MicroPython, every half-bit is precious). Would you please stop trolling? The reasons for adopting Unicode as a separate data type were good and sufficient in 2000, and they remain so today, even if you have been fortunate enough not to burn yourself on character-byte conflation yet. What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython. PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
05.06.14 05:25, Terry Reedy написав(ла): I mentioned it as an alternative during the '393 discussion. I more than half agree that the FSR is the better choice for CPython, which had no particular attachment to UTF-16 in the way that I think Jython, for instance, does. Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is used instead of UCS4) is the better choice for CPython. I suppose that with populating emoticons and other icon characters in nearest 5 or 10 years, even English text will often contain astral characters. And spending 4 bytes per character if long text contains one astral character looks too prodigally. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Serhiy Storchaka writes: Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is used instead of UCS4) is the better choice for CPython. I suppose that with populating emoticons and other icon characters in nearest 5 or 10 years, even English text will often contain astral characters. And spending 4 bytes per character if long text contains one astral character looks too prodigally. Why use something that complex if you don't have to? For the use case you have in mind, just map them into private space. If you really want to be aggressive, use surrogate space, too (anything that cares what a scalar represents should be trapping on non-scalars, catch that exception and look up the char -- dangerous, though, because such exceptions are probably all over the place). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Request: new Asyncio component on the bug tracker
Hi, Would it be possible to add a new Asyncio component on bugs.python.org? If this component is selected, the default nosy list for asyncio would be used (guido, yury and me, there is already such list in the nosy list completion). Full text search for asyncio returns too many results. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Hello, On Wed, 04 Jun 2014 22:15:30 -0400 Terry Reedy tjre...@udel.edu wrote: On 6/4/2014 6:52 PM, Paul Sokolovsky wrote: Well is subjective (or should be defined formally based on the requirements). With my MicroPython hat on, an implementation which receives a string, transcodes it, leading to bigger size, just to immediately transcode back and send out - is awful, environment unfriendly implementation ;-). I am not sure what you concretely mean by 'receive a string', but I I (surely) mean an abstract input (as an Input/Output aka I/O) operation. think you are again batting at a strawman. If you mean 'read from a file', and all you want to do is read bytes from and write bytes to external 'files', then there is obviously no need to transcode and neither Python 2 or 3 make you do so. But most files, network protocols are text-based, and I (and many other people) don't want to artificially use binary data type for them, with all attached funny things, like b prefix. And then Python2 indeed doesn't transcode anything, and Python3 does, without being asked, and for no good purpose, because in most cases, Input data will be Output as-is (maybe in byte-boundary-split chunks). So, it all goes in rounds - ignoring the forced-Unicode problem (after a week of subscription to python-list, half of traffic there appear to be dedicated to Unicode-related flames) on python-dev behalf is not going to help (Python community). [] -- Best regards, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Hello, On Thu, 05 Jun 2014 16:54:11 +0900 Stephen J. Turnbull step...@xemacs.org wrote: Paul Sokolovsky writes: Please put that in perspective when alarming over O(1) indexing of inherently problematic niche datatype. (Again, it's not my or MicroPython's fault that it was forced as standard string type. Maybe if CPython seriously considered now-standard UTF-8 encoding, results of what is str type might be different. But CPython has gigabytes of heap to spare, and for MicroPython, every half-bit is precious). Would you please stop trolling? The reasons for adopting Unicode as a separate data type were good and sufficient in 2000, and they remain If it was kept at separate data type bay, there wouldn't be any problem. But it was made one and only string type, and all strife started then. And there going to be trolling as long as Python developers and decision-makers will ignore (troll?) outcry from the community (again, I was surprised and not surprised to see ~50% of traffic on python-list touches Unicode issues). Well, I understand the plan - hoping that people will get over this. And I'm personally happy to stay away from this trolling, but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced. Then for me, it's just a matter of job security and personal future - I don't want to spend rest of my days as a javascript (or other idiotic language) monkey. And the message is clear in the air (http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ and elsewhere): if Python strings are now in Go, and in Python itself are now Java strings, all causing strife, why not go cruising around and see what's up, instead of staying strong, and growing bigger, community. so today, even if you have been fortunate enough not to burn yourself on character-byte conflation yet. What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython. PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation. I knew all this before very well. What's strange is that other developers don't know, or treat seriously, all of the above. That's why gentleman who kindly was interested in adding Unicode support to MicroPython started with the idea of dragging in CPython implementation. And the only effect persuasion that it's not necessarily the best solution had, was that he started to feel that he's being manipulated into writing something ugly, instead of the bright idea he had. That's why another gentleman reduces it to: O(1) on string indexing or not a Python!. And that's why another gentleman, who agrees to UTF-8 arguments, still gives an excuse (https://mail.python.org/pipermail/python-dev/2014-June/134727.html): In this context, while a fixed-width encoding may be the correct choice it would also likely be the wrong choice. In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like the best, the only right, the only correct, righter than, more correct than in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all. -- Best regards, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 17:54, Stephen J. Turnbull step...@xemacs.org wrote: What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython. However, as others have noted in the thread, the critical thing is to *not* let that internal implementation detail leak into the Python level string behaviour. That's what happened with narrow builds of Python 2 and pre-PEP-393 releases of Python 3 (effectively using UTF-16 internally), and it was the cause of a sufficiently large number of bugs that the Linux distributions tend to instead accept the memory cost of using wide builds (4 bytes for all code points) for affected versions. Preserving the the Python 3 str type is an immutable array of code points semantics matters significantly more than whether or not indexing by code point is O(1). The various caching tricks suggested in this thread (especially leading ASCII characters, trailing ASCII characters and position index of last lookup) could keep the typical lookup performance well below O(N). PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation. CPython is constrained by C API compatibility requirements, as well as implementation constraints due to the amount of internal code that would need to be rewritten to handle a variable width encoding as the canonical internal representation (since the problems with Python 2 narrow builds mean we already know variable width encodings aren't handled correctly by the current code). Implementations that share code with CPython, or try to mimic the C API especially closely, may face similar restrictions. Outside that, I think we're better off if alternative implementations are free to experiment with different internal string representations. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 21:25, Paul Sokolovsky pmis...@gmail.com wrote: Well, I understand the plan - hoping that people will get over this. And I'm personally happy to stay away from this trolling, but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced. Many of the challenges network programmers face in Python 3 are around binary data being more inconvenient to work with than it needs to be, not the fact we decentralised boundary code by offering a strict binary/text separation as the default mode of operation. Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well. More on that at http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3 Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Hello, On Thu, 5 Jun 2014 21:43:16 +1000 Nick Coghlan ncogh...@gmail.com wrote: On 5 June 2014 21:25, Paul Sokolovsky pmis...@gmail.com wrote: Well, I understand the plan - hoping that people will get over this. And I'm personally happy to stay away from this trolling, but any discussion related to Unicode goes in circles and returns to feeling that Unicode at the central role as put there by Python3 is misplaced. Many of the challenges network programmers face in Python 3 are around binary data being more inconvenient to work with than it needs to be, not the fact we decentralised boundary code by offering a strict binary/text separation as the default mode of operation. Just to clarify - (many) other gentlemen and I (in that order, I'm not taking a lead), don't call to go back to Python2 behavior with implicit conversion between byte-oriented strings and Unicode, etc. They just point out that perhaps Python3 went too far with Unicode cause by making it the default string type. Strict separation is surely mostly good thing (I can sigh that it leads to Java-like dichotomical bloat for all I/O classes, but well, I was able to put up with that in MicroPython already). Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well. All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes), while move unicode back to an explicit type to be used explicitly only when needed (bloated frameworks like Django can force users to it anyway, but that will be forcing on framework level, not on language level, against which people rebel.) People can dream, right? Thanks, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Paul Sokolovsky pmis...@gmail.com wrote: In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like the best, the only right, the only correct, righter than, more correct than in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all. Several core-devs have said that using UTF-8 for MicroPython is perfectly okay. I also think it's the right choice and I hope that you guys come up with a very efficient implementation. Stefan Krah ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 22:01, Paul Sokolovsky pmis...@gmail.com wrote: Aside from some of the POSIX locale handling issues on Linux, many of the concerns are with the usability of bytes and bytearray, not with str - that's why binary interpolation is coming back in 3.5, and there will likely be other usability tweaks for those types as well. All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes), while move unicode back to an explicit type to be used explicitly only when needed (bloated frameworks like Django can force users to it anyway, but that will be forcing on framework level, not on language level, against which people rebel.) People can dream, right? If you don't model strings as arrays of code points, or at least assume a particular universal encoding (like UTF-8), you have to give up string concatenation in order to tolerate arbitrary encodings - otherwise you end up with unintelligible data that nobody can decode because it switches encodings without notice. That's a viable model if your OS guarantees it (Mac OS X does, for example, so Python 3 assumes UTF-8 for all OS interfaces there), but Linux currently has no such guarantee - many runtimes just decide they don't care, and assume UTF-8 anyway (Python 3 may even join them some day, due to the problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 22:01, Paul Sokolovsky pmis...@gmail.com wrote: All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes) To me, an encoding neutral string type means roughly characters are atomic, and the best representation we have for a character is a Unicode code point. Through any interface that provides characters each individual character (code point) is indivisible. To me, Python 3 has exactly an encoding-neutral string type. It also has a bytes type that is is just that - bytes which can represent anything at all.It might be the UTF-8 representation of a string, but you have the freedom to manipulate it however you like - including making it no longer valid UTF-8. Whilst I think O(1) indexing of strings is important, I don't think it's as important as the property that characters are indivisible and would be quite happy for MicroPython to use UTF-8 as the underlying string representation (or some more clever thing, several ideas in this thread) so long as: 1. It maintains a string type that presents code points as indivisible elements; 2. The performance consequences of using UTF-8 are documented, as well as any optimisations, tricks, etc that are used to overcome those consequences (and what impact if any they would have if code written for MicroPython was run in CPython). Cheers, Tim Delaney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Hello, On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan ncogh...@gmail.com wrote: [] problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there). ... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write correct programs easy and pleasant. (And definition of correct vary.) But all that is just an opinion. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia -- Best regards, Paul mailto:pmis...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 22:10, Stefan Krah ste...@bytereef.org wrote: Paul Sokolovsky pmis...@gmail.com wrote: In this regard, I'm glad to participate in mind-resetting discussion. So, let's reiterate - there's nothing like the best, the only right, the only correct, righter than, more correct than in CPython's implementation of Unicode storage. It is *arbitrary*. Well, sure, it's not arbitrary, but based on requirements, and these requirements match CPython's (implied) usage model well enough. But among all possible sets of requirements, CPython's requirements are no more valid that other possible. And other set of requirement fairly clearly lead to situation where CPython implementation is rejected as not correct for those requirements at all. Several core-devs have said that using UTF-8 for MicroPython is perfectly okay. I also think it's the right choice and I hope that you guys come up with a very efficient implementation. Based on this discussion , I've also posted a draft patch aimed at clarifying the relevant aspects of the data model section of the language reference (http://bugs.python.org/issue21667). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 22:37, Paul Sokolovsky pmis...@gmail.com wrote: On Thu, 5 Jun 2014 22:20:04 +1000 Nick Coghlan ncogh...@gmail.com wrote: problems caused by trusting the locale encoding to be correct, but the startup code will need non-trivial changes for that to happen - the C.UTF-8 locale may even become widespread before we get there). ... And until those golden times come, it would be nice if Python did not force its perfect world model, which unfortunately is not based on surrounding reality, and let users solve their encoding problems themselves - when they need, because again, one can go quite a long way without dealing with encodings at all. Whereas now Python3 forces users to deal with encoding almost universally, but forcing a particular for all strings (which is again, doesn't correspond to the state of surrounding reality). I already hear response that it's good that users taught to deal with encoding, that will make them write correct programs, but that's a bit far away from the original aim of making it write correct programs easy and pleasant. (And definition of correct vary.) As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point. Windows, Mac OS X, and the JVM are all opinionated about the text encodings to be used at platform boundaries (using UTF-16, UTF-8 and UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX) says well, it's configurable, but we won't provide a reliable mechanism for finding out what the encoding is. So either guess as best you can based on the info the OS *does* provide, assume UTF-8, assume 'some ASCII compatible encoding', or don't do anything that requires knowing the encoding of the data being exchanged with the OS, like, say, displaying file names to users or accepting arbitrary text as input, transforming it in a content aware fashion, and echoing it back in a console application. None of those options are perfectly good choices. 6(ish) years ago, we chose the first option, because it has the best chance of working properly on Linux systems that use ASCII incompatible encodings like ShiftJIS, ISO-2022, and various other East Asian codecs. For normal user space programming, Linux is pretty reliable when it comes to ensuring the locale encoding is set to something sensible, but the price we currently pay for that decision is interoperability issues with things like daemons not receiving any configuration settings and hence falling back the POSIX locale and ssh environment forwarding moving a clients encoding settings to a session on a server with different settings. I still consider it preferable to impose inconveniences like that based on use case (situations where Linux systems don't provide sensible encoding settings) than geographic region (locales where ASCII incompatible encodings are likely to still be in common use). If I (or someone else) ever find the time to implement PEP 432 (or something like it) to address some of the limitations of the interpreter startup sequence that currently make it difficult to avoid relying on the POSIX locale encoding on Linux, then we'll be in a position to reassess that decision based on the increased adoption of UTF-8 by Linux distributions in recent years. As the major community Linux distributions complete the migration of their system utilities to Python 3, we'll get to see if they decide it's better to make their locale settings more reliable, or help make it easier for Python 3 to ignore them when they're wrong. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote: There is a discussion over at MicroPython about the internal representation of Unicode strings. Micropython is aimed at embedded devices, and so minimizing memory use is important, possibly even more important than performance. [...] Wow! I'm amazed at the response here, since I expected it would have received a fairly brief Yes or No response, not this long thread. Here is a summary (as best as I am able) of a few points which I think are important: (1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying That would be a pretty lousy option, and since nobody has really defended the suggestion, I think we can assume that it's off the table. (2) I asked if it would be okay for µPy to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's: Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module). but unless Guido wants to say different, I think the consensus is that a UTF-8 implementation is allowed, even at the cost of O(N) indexing operations. Saving memory -- assuming that it does save memory, which I think is an assumption and not proven -- over time is allowed. (3) It seems to me that there's been a lot of theorizing about what implementation will be obviously more efficient. Folks, how about some benchmarks before making claims about code efficiency? :-) (4) Similarly, there have been many suggestions more suited in my opinion to python-ideas, or even python-list, for ways to implement O(1) indexing on top of UTF-8. Some of them involve per-string mutable state (e.g. the last index seen), or complicated int sub-classes that need to know what string they come from. Remember your Zen please: Simple is better than complex. Complex is better than complicated. ... If the implementation is hard to explain, it's a bad idea. (5) I'm not convinced that UTF-8 internally is *necessarily* more efficient, but look forward to seeing the result of benchmarks. The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 in the first place saves the transcoding step. Well, yes, but many strings may never be written out: print(prefix + s[1:].strip().lower().center(80) + suffix) creates five strings that are never written out and one that is. So if the internal encoding of strings is more efficient than UTF-8, and most of them never need transcoding to UTF-8, a non-UTF-8 internal format might be a nett win. So I'm looking forward to seeing the results of µPy's experiments with it. Thanks to all who have commented. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Request: new Asyncio component on the bug tracker
On Thu, 05 Jun 2014 12:03:15 +0200, Victor Stinner victor.stin...@gmail.com wrote: Would it be possible to add a new Asyncio component on bugs.python.org? If this component is selected, the default nosy list for asyncio would be used (guido, yury and me, there is already such list in the nosy list completion). Done. There are two other people in the nosy list (Giapaolo and Antoine). If either of those wish to be auto-nosy, let me know. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 5 June 2014 14:15, Nick Coghlan ncogh...@gmail.com wrote: As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point. There is once again a strong selection bias in this discussion, by its very nature. People who like the new model don't have anything to complain about, and so are not heard. Just to support Nick's point, I for one find the Python 3 text model a huge benefit, both in practical terms of making my programs more robust, and educationally, as I have a far better understanding of encodings and their issues than I ever did under Python 2. Whenever a discussion like this occurs, I find it hard not to resent the people arguing that the new model should be taken away from me and replaced with a form of the old error-prone (for me) approach - as if it was in my best interests. Internal details don't bother me - using UTF8 and having indexing be potentially O(N) is of little relevance. But make me work with a string type that *doesn't* abstract a string as a sequence of Unicode code points and I'll get very upset. Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On Thu, Jun 5, 2014 at 11:59 AM, Paul Moore p.f.mo...@gmail.com wrote: On 5 June 2014 14:15, Nick Coghlan ncogh...@gmail.com wrote: As I've said before in other contexts, find me Windows, Mac OS X and JVM developers, or educators and scientists that are as concerned by the text model changes as folks that are primarily focused on Linux system (including network) programming, and I'll be more willing to concede the point. There is once again a strong selection bias in this discussion, by its very nature. People who like the new model don't have anything to complain about, and so are not heard. Just to support Nick's point, I for one find the Python 3 text model a huge benefit, both in practical terms of making my programs more robust, and educationally, as I have a far better understanding of encodings and their issues than I ever did under Python 2. Whenever a discussion like this occurs, I find it hard not to resent the people arguing that the new model should be taken away from me and replaced with a form of the old error-prone (for me) approach - as if it was in my best interests. Internal details don't bother me - using UTF8 and having indexing be potentially O(N) is of little relevance. But make me work with a string type that *doesn't* abstract a string as a sequence of Unicode code points and I'll get very upset. Once you get past whether str + bytes throws an exception which seems to be the divide most people focus on, you can discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 6/5/2014 3:10 AM, Paul Sokolovsky wrote: Hello, On Wed, 04 Jun 2014 22:15:30 -0400 Terry Reedy tjre...@udel.edu wrote: think you are again batting at a strawman. If you mean 'read from a file', and all you want to do is read bytes from and write bytes to external 'files', then there is obviously no need to transcode and neither Python 2 or 3 make you do so. But most files, network protocols are text-based, and I (and many other people) don't want to artificially use binary data type for them, with all attached funny things, like b prefix. And then Python2 indeed doesn't transcode anything, and Python3 does, without being asked, and for no good purpose, because in most cases, Input data will be Output as-is (maybe in byte-boundary-split chunks). So, it all goes in rounds - ignoring the forced-Unicode problem (after a week of subscription to python-list, half of traffic there appear to be dedicated to Unicode-related flames) on python-dev behalf is not going to help (Python community). If all your program is doing is reading and writing data (input data will be output as-is), then use of binary doesn't require b prefix, because you aren't manipulating the data. Then you have no unnecessary transcoding. If you actually wish to examine or manipulate the content as it flows by, then there are choices. 1) If you need to examine/manipulate only a small fraction of text data with the file, you can pay the small price of a few b prefixes to get high performance, and explicitly transcode only the portions that need to be manipulated. 2) If you are examining the bulk of the data as it flows by, but not manipulating it, just examining/extracting, then a full transcoding may be useful for that purpose... but you can perhaps do it explicitly, so that you keep the binary form for I/O. Careful of the block boundaries, in this case, however. 3) If you are actually manipulating the bulk of the data, then the double transcoding (once on input, and once on output) allows you to work in units of codepoints, rather than bytes, which generally makes the manipulation algorithms easier. 4) If you truly cannot afford the processor code of the double transcoding, and need to do all your manipulations at the byte level, then you could avoid the need for b prefix by use of a preprocessor for those sections of code that are doing all and only bytes processing... and you'll have lots of arcane, error-prone code to write to manipulate the bytes rather than the codepoints. On the other hand, if you can convince your data sources and sinks to deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid transcoding, and make the arcane algorithms part of the implementation of μPy rather than of the application code, and support full Unicode. And it seems to me that the world is moving that way... towards UTF-8 as the standard interchange format. Encourage it. Glenn ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 6/5/2014 11:41 AM, Daniel Holth wrote: discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2. Yes, people can find ways to write bad code in any language. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Le 04/06/2014 02:51, Chris Angelico a écrit : On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan ncogh...@gmail.com wrote: It would. The downsides of a UTF-8 representation would be slower iteration and much slower (O(N)) indexing/slicing. There's no reason for iteration to be slower. Slicing would get O(slice offset + slice size) instead of O(slice size). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Hi all, There's a very valuable optimization -- temporary elision -- which numpy can *almost* do. It gives something like a 10-30% speedup for lots of common real-world expressions. It would probably would be useful for non-numpy code too. (In fact it generalizes the str += str special case that's currently hardcoded in ceval.c.) But it can't be done safely without help from the interpreter, and possibly not even then. So I thought I'd raise it here and see if we can get any consensus on whether and how CPython could support this. === The dream === Here's the idea. Take an innocuous expression like: result = (a + b + c) / c This gets evaluated as: tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic. If all the operands are already allocated, then performing a three-operand operation like tmp1 = a + b involves 3N bytes of memory traffic (reading the two inputs plus writing the output). In total our example does 3 allocations and has 9 operands, so it does 12N bytes of memory access. If our arrays are small, then the kernel doesn't get involved and some of these accesses will hit the cache, but OTOH the overhead of things like malloc won't be amortized out; the best case starting from a cold cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if we get lucky and malloc'ing 'result' returns the same memory that tmp1 used, and it's still in cache). There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1 Now we have only 1 allocation and 7 operands, so we touch only 8N bytes of memory. For large arrays -- that don't fit into cache, and for which per-op overhead is amortized out -- this gives a theoretical 33% speedup, and we can realistically get pretty close to this. For smaller arrays, the re-use of tmp1 means that in the best case we have only 1 malloc and 4N bytes worth of cache misses, and we also have a smaller cache footprint, which means this best case will be achieved more often in practice. For small arrays it's harder to estimate the total speedup here, but 66% fewer mallocs and 33% fewer cache misses is certainly enough to make a practical difference. Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare num_update to num_inplace): https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update. With our prototype automatic temporary elision turned on, this difference disappears -- the natural code gets 10-15% faster, *and* we remove the temptation to write horrible things like num_inplace. What do I mean by automatic temporary elision? It's *almost* possible for numpy to automatically convert the first example into the second. The idea is: we want to replace tmp2 = tmp1 + c with tmp1 += c tmp2 = tmp1 And we can do this by defining def __add__(self, other): if is_about_to_be_thrown_away(self): return self.__iadd__(other) else: ... now tmp1.__add__(c) does an in-place add and returns tmp1, no allocation occurs, woohoo. The only little problem is implementing is_about_to_be_thrown_away(). === The sneaky-but-flawed approach === The following implementation may make you cringe, but it comes tantalizingly close to working: bool is_about_to_be_thrown_away(PyObject * obj) { return (Py_REFCNT(obj) == 1); } In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse. But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So,
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote: Is there a better idea I'm missing? Just a thought, but the temporaries come from the stack manipulation done by the likes of the BINARY_ADD opcode. (After all the bytecode doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and friends could allow for an alternative fast calling convention for __add__implementations that uses the stack slots directly? This may be something that's only plausible from C code, though. Or may not be plausible at all. I haven't looked at ceval.c for many years... If this is an insane idea, please feel free to ignore me :-) Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On Thu, Jun 5, 2014 at 10:37 PM, Paul Moore p.f.mo...@gmail.com wrote: On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote: Is there a better idea I'm missing? Just a thought, but the temporaries come from the stack manipulation done by the likes of the BINARY_ADD opcode. (After all the bytecode doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and friends could allow for an alternative fast calling convention for __add__implementations that uses the stack slots directly? This may be something that's only plausible from C code, though. Or may not be plausible at all. I haven't looked at ceval.c for many years... If this is an insane idea, please feel free to ignore me :-) To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc., which BINARY_ADD and friends would check for and if found, dispatch to without going through PyNumber_Add? And this way, a type like numpy's array could have a special implementation for __te_add__ that works the same as __add__, except with the added wrinkle that it knows that it will only be called by the interpreter and thus any arguments with refcnt 1 must be temporaries? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote: To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc. I wasn't thinking in that much detail, TBH. I'm not sure adding a whole set of new slots is sensible for such a specialised case. I think I was more assuming that the special method implementations could use an alternative calling convention, METH_STACK in place of METH_VARARGS, for example. That would likely only be viable for types implemented in C. But either way, it may be more complicated than the advantages would justify... Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 6/5/2014 4:51 PM, Nathaniel Smith wrote: In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse. But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go. I understand that a lot of numpy/scipy code is compiled with Cython, so you really want the optimization to continue working when so compiled. Is there a simple change to Cython that would work, perhaps in coordination with a change to numpy? Is so, you could get the result before 3.5 comes out. I realized that there are other compilers than Cython and non-numpy code that could benefit, so that a more generic solution would also be good. In particular Here's the idea. Take an innocuous expression like: result = (a + b + c) / c This gets evaluated as: tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c ... There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1 Could this transformation be done in the ast? And would that help? A prolonged discussion might be better on python-ideas. See what others say. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On Thu, Jun 5, 2014 at 11:12 PM, Paul Moore p.f.mo...@gmail.com wrote: On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote: To make sure I understand correctly, you're suggesting something like adding a new set of special method slots, __te_add__, __te_mul__, etc. I wasn't thinking in that much detail, TBH. I'm not sure adding a whole set of new slots is sensible for such a specialised case. I think I was more assuming that the special method implementations could use an alternative calling convention, METH_STACK in place of METH_VARARGS, for example. That would likely only be viable for types implemented in C. But either way, it may be more complicated than the advantages would justify... Oh, I see, that's clever. But, unfortunately most __special__ methods at the C level don't use METH_*, they just have hard-coded calling conventions: https://docs.python.org/3/c-api/typeobj.html#number-structs -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On 6 Jun 2014 05:13, Glenn Linderman v+pyt...@g.nevcal.com wrote: On 6/5/2014 11:41 AM, Daniel Holth wrote: discover new things like dance-encoded strings, bytes decoded using an incorrect encoding intended to be transcoded into the correct encoding later, surrogates that work perfectly until .encode(), str(bytes), APIs that disagree with you about whether the result should be str or bytes, APIs that return either string or bytes depending on their initializers and so on. Unicode can still be complicated in Python 3 independent of any judgement about whether it is worse, better, or different than Python 2. Yes, people can find ways to write bad code in any language. Note that several of the issues Daniel mentions here are due to the lack of reliable encoding settings on Linux and the challenges of the Py2-3 migration, rather than users writing bad code. Several of them represent bugs to be fixed or serve as indicators of missing features that would make it easier to work around an imperfect world. Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Steven D'Aprano wrote: (1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying That would be a pretty lousy option, It would be limiting to have this as the *only* way of dealing with unicode, but I don't see anything wrong with having this available as an option for applications that truly don't need anything more than ascii. There must be plenty of those; the controller that runs my car engine, for example, doesn't exchange text with the outside world at all. The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed, No, I think the rationale is that UTF-8 is likely to use less memory than UTF-16 or UTF-32. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Nathaniel Smith n...@pobox.com writes: Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare num_update to num_inplace): https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update. Does it really have to be that ugly? Shouldn't using tmp += u[2:,1:-1] tmp *= dy2 instead of np.add(tmp, u[2:,1:-1], out=tmp) np.multiply(tmp, dy2, out=tmp) give the same performance? (yes, not as nice as what you're proposing, but I'm still curious). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 6 Jun 2014 02:16, Nikolaus Rath nikol...@rath.org wrote: Nathaniel Smith n...@pobox.com writes: Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare num_update to num_inplace): https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update. Does it really have to be that ugly? Shouldn't using tmp += u[2:,1:-1] tmp *= dy2 instead of np.add(tmp, u[2:,1:-1], out=tmp) np.multiply(tmp, dy2, out=tmp) give the same performance? (yes, not as nice as what you're proposing, but I'm still curious). Yes, only the last line actually requires the out= syntax, everything else could use in place operators instead (and automatic temporary elision wouldn't work for the last line anyway). I guess whoever wrote it did it that way for consistency (and perhaps in hopes of eking out a tiny bit more speed - in numpy currently the in-place operators are implemented by dispatching to function calls like those). Not sure how much difference it really makes in practice though. It'd still be 8 statements and two named temporaries to do the work of one infix expression, with order of operations implicit. -n ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 5 Jun 2014 23:58, Terry Reedy tjre...@udel.edu wrote: On 6/5/2014 4:51 PM, Nathaniel Smith wrote: In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse. But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go. I understand that a lot of numpy/scipy code is compiled with Cython, so you really want the optimization to continue working when so compiled. Is there a simple change to Cython that would work, perhaps in coordination with a change to numpy? Is so, you could get the result before 3.5 comes out. Unfortunately we don't actually know whether Cython is the only culprit (such code *could* be written by hand), and even if we fixed Cython it would take some unknowable amount of time before all downstream users upgraded their Cythons. (It's pretty common for projects to check in Cython-generated .c files, and only regenerate when the Cython source actually gets modified.) Pretty risky for an optimization. I realized that there are other compilers than Cython and non-numpy code that could benefit, so that a more generic solution would also be good. In particular Here's the idea. Take an innocuous expression like: result = (a + b + c) / c This gets evaluated as: tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c ... There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1 Could this transformation be done in the ast? And would that help? I don't think it could be done in the ast because I don't think you can work with anonymous temporaries there. But, now that you mention it, it could be done on the fly in the implementation of the relevant opcodes. I.e., BIN_ADD could do if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right) Upside: all packages automagically benefit! Potential downsides to consider: - Subtle but real and user-visible change in Python semantics. I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed. Maybe this doesn't make sense though. - Only works when left operand is the temporary (remember that a*b+c is faster than c+a*b), and only for arithmetic (no benefit for np.sin(a + b)). Probably does cover the majority of cases though. A prolonged discussion might be better on python-ideas. See what others say. Yeah, I wasn't sure which list to use for this one, happy to move if it would work better. -n ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On Fri, Jun 6, 2014 at 11:47 AM, Nathaniel Smith n...@pobox.com wrote: Unfortunately we don't actually know whether Cython is the only culprit (such code *could* be written by hand), and even if we fixed Cython it would take some unknowable amount of time before all downstream users upgraded their Cythons. (It's pretty common for projects to check in Cython-generated .c files, and only regenerate when the Cython source actually gets modified.) Pretty risky for an optimization. But code will still work, right? I mean, you miss out on an optimization, but it won't actually be wrong code? It should be possible to say After upgrading to Cython version x.y, regenerate all your .c files to take advantage of this new optimization. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Nathaniel Smith wrote: I.e., BIN_ADD could do if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right) Upside: all packages automagically benefit! Potential downsides to consider: - Subtle but real and user-visible change in Python semantics. That would be a real worry. Even if such cases were rare, they'd be damnably difficult to debug when they did occur. I think for safety's sake this should only be done if the type concerned opts in somehow, perhaps by a tp_flag indicating that the type is eligible for temporary elision. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
On 05/06/14 22:51, Nathaniel Smith wrote: This gets evaluated as: tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic. It seems to be the case that a large portion of the run-time in Python code using NumPy can be spent in the kernel zeroing pages (which the kernel does for security reasons). I think this can also be seen as a 'malloc problem'. It comes about because each new NumPy array starts with a fresh buffer allocated by malloc. Perhaps buffers can be reused? Sturla ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Nathaniel Smith wrote: I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed. I can think of at least one plausible scenario where this could occur: the operand is a view object that wraps another object, and its __iadd__ method updates that other object. In fact, now that I think about it, exactly this kind of thing happens in numpy when you slice an array! So the opt-in indicator would need to be dynamic, on a per-object basis, rather than a type flag. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Nathaniel Smith n...@pobox.com writes: tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1 Could this transformation be done in the ast? And would that help? I don't think it could be done in the ast because I don't think you can work with anonymous temporaries there. But, now that you mention it, it could be done on the fly in the implementation of the relevant opcodes. I.e., BIN_ADD could do if (Py_REFCNT(left) == 1) result = PyNumber_InPlaceAdd(left, right); else result = PyNumber_Add(left, right) Upside: all packages automagically benefit! Potential downsides to consider: - Subtle but real and user-visible change in Python semantics. I'd be a little nervous about whether anyone has implemented, say, an iadd with side effects such that you can tell whether a copy was made, even if the object being copied is immediately destroyed. Maybe this doesn't make sense though. Hmm. I don't think this is as unlikely as it may sound. Consider eg the h5py module: with h5py.File('database.h5') as fh: result = fh['key'] + np.ones(42) if this were transformed to with h5py.File('database.h5') as fh: tmp = fh['key'] tmp += np.ones(42) result = tmp then the database.h5 file would get modified, *and* result would be of type h5py.Dataset rather than np.array. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
Paul Sokolovsky wrote: All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes) Can you elaborate on exactly what you have in mind? You seem to want something different from Python 3 str, Python 3 bytes and Python 2 str, but it's far from clear what you want this type to be like. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Internal representation of strings and Micropython (Steven D'Aprano's summary)
Steven D'Aprano wrote: (1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying That would be a pretty lousy option, and since nobody has really defended the suggestion, I think we can assume that it's off the table. Lousy is not quite the same as forbidden. Doing it in good faith would require making the limit prominent in the documentation, and raising some sort of CharacterNotSupported exception (or at least a warning) whenever there is an attempt to create a non-ASCII string, even via the C API. (2) I asked if it would be okay ... to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's: [Non-ASCII character removed.] It is bad when quirks -- even good quirks -- of one implementation lead people to write code that will perform badly on a different Python implementation. Cpython has at least delayed obvious optimizations for this reason. Changing idiomatic operations from O(1) to O(N) is big enough to cause a concern. That said, the target environment itself apparently limits N to small enough that the problem should be mostly theoretical. If you want to be good citizens, then do put a note in the documentation warning that particularly long strings are likely to cause performance issues unique to the MicroPython implementation. (Frankly, my personal opinion is that if you're really optimizing for space, then long strings will start getting awkward long before N is big enough for algorithmic complexity to overcome constant factors.) ... those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 ... That all assumes that the external world is using UTF-8 anyhow. Which is more likely to be true if you document it as a limitation of MicroPython. ... but many strings may never be written out: print(prefix + s[1:].strip().lower().center(80) + suffix) creates five strings that are never written out and one that is. But looking at the actual strings -- UTF-8 doesn't really hurt much. Only the slice and center() are more complex, and for a string less than 80 characters long, O(N) is irrelevant. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com