Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote: On 11 June 2014 05:43, alister alister.nospam.w...@ntlworld.com wrote: Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives By his own admission, jmf doesn't use Python anymore. His only reason to remain on this emailing/newsgroup is to troll about the FSR. Please don't reply to him (and preferably add him to your killfile). I couldn't kill file JMF I find his posts useful Every time i find myself agreeing with him I know I have got it wrong. -- The nice thing about Windows is - It does not just crash, it displays a dialog box and lets you press 'OK' first. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
alister alister.nospam.w...@ntlworld.com writes: On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote: By his own admission, jmf doesn't use Python anymore. His only reason to remain on this emailing/newsgroup is to troll about the FSR. Please don't reply to him (and preferably add him to your killfile). I couldn't kill file JMF I find his posts useful That's fine, kill-filing his posts is a matter that affects only you. But please do not reply to them, nor taunt him in unrelated posts; it disrupts this forum. Instead, give him no reason to think anyone is interested. -- \ “Too many pieces of music finish too long after the end.” —Igor | `\ Stravinskey | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 06/10/2014 01:43 PM, alister wrote: On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: BTW, very easy to explain. Yeah he keeps saying that, but he never does explain--just flails around and mumbles unicode.org. Guess everyone has to have his or her windmill to tilt at. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit : On 2014-06-06 09:59, Travis Griggs wrote: On Jun 4, 2014, at 4:01 AM, Tim Chase wrote: If you use UTF-8 for everything It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. I definitely advocate UTF-8 for any streaming scenario, as you're iterating unidirectionally over the data anyways, so why use/transmit more bytes than needed. The only failing of UTF-8 that I've found in the real world(*) is when you have to requirement of constant-time indexing into strings. -tkc And once again, just an illustration, timeit.repeat((x*1000 + y), setup=x = 'abc'; y = 'z') [0.9457552436453511, 0.9190932610143818, 0.9322044912393039] timeit.repeat((x*1000 + y), setup=x = 'abc'; y = '\u0fce') [2.5541921791045183, 2.52434366066052, 2.5337417948967413] timeit.repeat((x*1000 + y), setup=x = 'abc'.encode('utf-8'); y = 'z'.encode('utf-8')) [0.9168235779232532, 0.8989583403075017, 0.8964204541650247] timeit.repeat((x*1000 + y), setup=x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')) [0.9320969737165115, 0.9086006535332558, 0.9051715140790861] sys.getsizeof('abc'*1000 + '\u0fce') 6040 sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8')) 3020 But you know, that's not the problem. When a see a core developper discussing benchmarking, when the same application using non ascii chars become 1, 2, 5, 10, 20 if not more, slower comparing to pure ascii, I'm wondering if there is not a serious problem somewhere. (and also becoming slower that Py3.2) BTW, very easy to explain. I do not understand why the free, open, what-you-wish-here, ... software is so often pushing to the adoption of serious corporate products. jmf Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives Computers store data using bytes ASCII Characters can be used storing a single byte Unicode code-points cannot be stored in a single byte therefore Unicode will always be inherently slower than ASCII implementation details mean that some Unicode characters may be handled more efficiently than others, why is this wrong? why should all Unicode operations be equally slow? -- There isn't any problem -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 11 June 2014 05:43, alister alister.nospam.w...@ntlworld.com wrote: Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives By his own admission, jmf doesn't use Python anymore. His only reason to remain on this emailing/newsgroup is to troll about the FSR. Please don't reply to him (and preferably add him to your killfile). Tim Delaney -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 10/06/2014 20:43, alister wrote: On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: [snip the garbage] jmf Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives Computers store data using bytes ASCII Characters can be used storing a single byte Unicode code-points cannot be stored in a single byte therefore Unicode will always be inherently slower than ASCII implementation details mean that some Unicode characters may be handled more efficiently than others, why is this wrong? why should all Unicode operations be equally slow? I'd like to dedicate a song to jmf. From the Canterbury Sound band Caravan, the album The Battle Of Hastings, the song title Liar. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Please don't be unnecessarily cruel and antagonistic. -- Devin On Tue, Jun 10, 2014 at 4:16 PM, Mark Lawrence breamore...@yahoo.co.uk wrote: On 10/06/2014 20:43, alister wrote: On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: [snip the garbage] jmf Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives Computers store data using bytes ASCII Characters can be used storing a single byte Unicode code-points cannot be stored in a single byte therefore Unicode will always be inherently slower than ASCII implementation details mean that some Unicode characters may be handled more efficiently than others, why is this wrong? why should all Unicode operations be equally slow? I'd like to dedicate a song to jmf. From the Canterbury Sound band Caravan, the album The Battle Of Hastings, the song title Liar. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, 10 Jun 2014 19:43:13 +, alister wrote: On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: Please don't feed the troll. I don't know whether JMF is trolling or if he is a crank who doesn't understand what he is doing, but either way he's been trying to square this circle for the last couple of years. He believes, or *claims* to believe, that a performance regression (one which others cannot replicate) is *mathematical proof* that Python's Unicode handling is invalid. What can one say to crack-pottery of this magnitude? Just kill-file his posts and be done. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 06/10/2014 04:29 PM, Devin Jeanpierre wrote: Please don't be unnecessarily cruel and antagonistic. I completely agree. jmf should leave us alone and stop cruelly and antagonizingly baiting us with stupidity and falsehoods. -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 11/06/2014 00:29, Devin Jeanpierre wrote: Please don't be unnecessarily cruel and antagonistic. -- Devin I am simply giving our resident unicode expert a taste of his own medicine. If you don't like that complain to the PSF about the root cause of the problem, not the symptoms. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Chris Angelico ros...@gmail.com writes: I don't have an actual use-case for this, as I don't target microcontrollers, but I'm curious: What parts of Py3 syntax aren't supported? I meant to say % formatting for strings but that's apparently been added recently. My previous micropython build was from February. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Jun 4, 2014, at 4:01 AM, Tim Chase python.l...@tim.thechases.com wrote: If you use UTF-8 for everything It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. It’s universal, not prone to endian issues, etc. So one *advantage* you gain for using utf8 internally, is any time you need to hand a string to an external thing, it’s just ready. An app that reserves its internal string processing to streaming based ones but has to to hand strings to external libraries a lot (e.g. cairo) might actually benefit using utf8 internally, because a) it’s not doing the linear search for the odd character address and b) it no longer needs to decode/encode every time it sends or receives a string to an external library. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
In article mailman.10822.1402073958.18130.python-l...@python.org, Travis Griggs travisgri...@gmail.com wrote: On Jun 4, 2014, at 4:01 AM, Tim Chase python.l...@tim.thechases.com wrote: If you use UTF-8 for everything It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. It¹s universal, not prone to endian issues, etc. One of the important etc factors is, Since it's the most commonly used, it's the one that other people are most likely to have implemented correctly. In the real world, these are important considerations. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 2014-06-06 09:59, Travis Griggs wrote: On Jun 4, 2014, at 4:01 AM, Tim Chase wrote: If you use UTF-8 for everything It seems to me, that increasingly other libraries (C, etc), use utf8 as the preferred string interchange format. I definitely advocate UTF-8 for any streaming scenario, as you're iterating unidirectionally over the data anyways, so why use/transmit more bytes than needed. The only failing of UTF-8 that I've found in the real world(*) is when you have to requirement of constant-time indexing into strings. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
In article f935e85f-f86a-4821-86ab-3ab7e5e21...@googlegroups.com, Rustom Mody rustompm...@gmail.com wrote: On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote: Yup. I wrote a while(*) back about the pain I was having importing some data into a MySQL(**) database Here's my interpretation of that situation; I'd like to hear yours: Basic problem was that MySQL handled a strict subset of what the rest of the system (Python 2.7?) could handle. Yes. This was not a Python issue. I was just responding to ChrisA's statement: Binding your program to BMP-only is nearly as dangerous as binding it to ASCII-only; potentially worse, because you can run an awful lot of artificial tests without remembering to stick in some astral characters. Of course switching to postgres may be a sound choice on other fronts. But if that were not an option, and you only had these choices: - significantly complexify your MySQL data structures to handle 4 in 20 million cases - just detect and throw such cases out at the outset which would you take? It turns out, we could have upgraded to a newer version of MySQL, which did handle astral characters correctly. But, what we did was discarded the records containing non-BMP data. Of course, that's a decision that can only be made when you understand the business requirements. In our case, discarding those four records had no impact on our business, so it made sense. For other people, not having the full dataset might have been a fatal problem. This was just one of many MySQL problems we ran into. Eventually, we decided it wasn't worth fighting with what was obviously a brain-dead system, and switched databases. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Thu, Jun 5, 2014 at 11:59 PM, Roy Smith r...@panix.com wrote: It turns out, we could have upgraded to a newer version of MySQL, which did handle astral characters correctly. But, what we did was discarded the records containing non-BMP data. Of course, that's a decision that can only be made when you understand the business requirements. In our case, discarding those four records had no impact on our business, so it made sense. For other people, not having the full dataset might have been a fatal problem. This was just one of many MySQL problems we ran into. Eventually, we decided it wasn't worth fighting with what was obviously a brain-dead system, and switched databases. Point to note: It's not just Avoid MySQL version x.y.z, it's buggy, but Make sure you're on a sufficiently new version of MySQL *and then use these settings*. For instance, the MySQL utf8 locale/collation/charset (not sure what it calls it) supports only the BMP; you have to use utf8mb4, which is UTF-8 that's allowed to go as far as four bytes long. What were they thinking? What, were they thinking? I understand there's now an alias utf8mb3 for the buggy utf8, with some theory that some future version of MySQL might make utf8 become an alias for utf8mb4. But when would you ever actually *demand* this buggy behaviour? Why not just say as of this version, utf8 is identical to utf8mb4, which was a superset thereof, and if anything changes or breaks, just acknowledge that it used to be buggy? /rant Use PostgreSQL. /obvious ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info wrote: For technical reasons which I don't fully understand, Unicode only uses 21 of those 32 bits, giving a total of 1114112 available code points. I think mainly it's to accommodate UTF-16. The surrogate pair scheme is sufficient to encode up to 16 supplementary planes, so if Unicode were allowed to grow any larger than that, UTF-16 would no longer be able to encode all codepoints. Another benefit of fixing the size is that it frees the other 11 bits per character of UTF-32 for packing in ancillary data. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 6/4/2014 1:55 AM, Ian Kelly wrote: On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info mailto:st...@pearwood.info wrote: For technical reasons which I don't fully understand, Unicode only uses 21 of those 32 bits, giving a total of 1114112 available code points. I think mainly it's to accommodate UTF-16. The surrogate pair scheme is sufficient to encode up to 16 supplementary planes, so if Unicode were allowed to grow any larger than that, UTF-16 would no longer be able to encode all codepoints. I believe the original utf-8 used up to 6 bytes per char to encode 2**32 potential chars. Just 4 bytes limits to 2**21 and for whatever reason (easier decoding?), utf-8 was revised down (unusual ;-). -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy tjre...@udel.edu wrote: On 6/4/2014 1:55 AM, Ian Kelly wrote: On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info mailto:st...@pearwood.info wrote: For technical reasons which I don't fully understand, Unicode only uses 21 of those 32 bits, giving a total of 1114112 available code points. I think mainly it's to accommodate UTF-16. The surrogate pair scheme is sufficient to encode up to 16 supplementary planes, so if Unicode were allowed to grow any larger than that, UTF-16 would no longer be able to encode all codepoints. I believe the original utf-8 used up to 6 bytes per char to encode 2**32 potential chars. Just 4 bytes limits to 2**21 and for whatever reason (easier decoding?), utf-8 was revised down (unusual ;-). I understood it to be UTF-16's fault, per Ian's statement. That is to say, the entire Unicode standard was warped around the problem that some people were going around thinking a character is 16 bits, even though that's just as fallacious as a character is 8 bits. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody rustompm...@gmail.com wrote: On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] Not if you're working on the internet. There are several critical groups of characters that aren't in the BMP, such as: Of course. But what has the internet to do with micropython? Earlier you said: IOW from pov of a universallly acceptable character set this is mostly rubbish Universally acceptable character set and microcontrollers may well not meet, but if you're talking about universality, you need Unicode. It's that simple. Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, but even if there is, I would expect a Python implementation on it to expose Unicode codepoints in its strings. (Most of the time you won't even be aware of the exact codepoint values. It's only when you put \xNN or \u or U000N escapes into your strings, or explicitly use ord/chr or equivalent, that it'd make a difference.) The point is not that you might be able to get away with sticking your head in the sand and wishing Unicode would just go away. Even if you can, it's not something Python 3 can ever do. And I don't think anybody can, anyway. If your device is big enough to hold Python, it should be big enough to handle Unicode; and then you don't have to say Oh, sorry rest-of-the-world, this only works in English... and only a subset of English... and stuff. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly ian.g.ke...@gmail.com wrote: On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody rustompm...@gmail.com wrote: 1) Most or all Chinese and Japanese characters Dont know how you count 'most' | One possible rationale is the desire to limit the size of the full | Unicode character set, where CJK characters as represented by discrete | ideograms may approach or exceed 100,000 (while those required for | ordinary literacy in any language are probably under 3,000). Version 1 | of Unicode was designed to fit into 16 bits and only 20,940 characters | (32%) out of the possible 65,536 were reserved for these CJK Unified | Ideographs. Later Unicode has been extended to 21 bits allowing many | more CJK characters (75,960 are assigned, with room for more). | From http://en.wikipedia.org/wiki/Han_unification So there are 20,940 CJK characters in the BMP, and approximately 55,000 more in the SIP. I'd count 55,000 out of 75,960 as most. And I said or all because I have this vague notion that either NFC or NFD pushes stuff out of the BMP, although I may be wrong on that. But certainly 55K/75K with room for more is the most that I was talking about. (Maybe it isn't most by usage. After all, hypertext documents are usually smaller in UTF-8 than in UTF-16, despite most characters (counting purely by 21-bit space in codepoints) being more compact in UTF-16; most by usage is of ASCII, because hypertext involves a lot of punctuation and such. But still, there are a lot of CJK that aren't in the BMP.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, 04 Jun 2014 17:16:13 +1000, Chris Angelico wrote: On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody rustompm...@gmail.com wrote: On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] Not if you're working on the internet. There are several critical groups of characters that aren't in the BMP, such as: Of course. But what has the internet to do with micropython? When I download a script from the Internet to run on my microcontroller, written by somebody in Greece, and it calls print on a Greek string, I should see Greek text even if I'm in Sweden or New Zealand or Japan. A fully localised application would be better, of course, but failing that I shouldn't see moji-bake. Earlier you said: IOW from pov of a universallly acceptable character set this is mostly rubbish Universally acceptable character set and microcontrollers may well not meet, but if you're talking about universality, you need Unicode. It's that simple. Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Steven D'Aprano st...@pearwood.info writes: Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. I've never understood why not use UTF-8 for everything. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 04.06.2014 09:16, Chris Angelico wrote: The point is not that you might be able to get away with sticking your head in the sand and wishing Unicode would just go away. Even if you can, it's not something Python 3 can ever do. Exactly. These endless discussions about different encodings start to get really boring. I cannot think of any aspect of it that hasn't been discussed here on several occasions, but as a fact: Strings are immutable sequences of Unicode code points in Python3 (https://docs.python.org/3/library/stdtypes.html?highlight=str#textseq) and this is not an implementation detail. So if any implementation doesn't stick to this convention, it is simply incomplete. And I don't think anybody can, anyway. If your device is big enough to hold Python, it should be big enough to handle Unicode; and then you don't have to say Oh, sorry rest-of-the-world, this only works in English... and only a subset of English... and stuff. Wolfgang -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 04/06/2014 08:58, Paul Rubin wrote: Steven D'Aprano st...@pearwood.info writes: Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. I've never understood why not use UTF-8 for everything. me too -mojibaked-ly yrs- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 2014-06-04 00:58, Paul Rubin wrote: Steven D'Aprano st...@pearwood.info writes: Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. I've never understood why not use UTF-8 for everything. If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Some of us slice strings for a living. ;-) I understand that using UTF-32 would allow us to maintain O(1) indexing at the cost of every string occupying 4 bytes per character. The FSR (again, as I understand it) allows strings that fit in one-byte-per-character to use that, scaling up to use wider characters internally as they're actually needed/used. At the cost of complexity and non-constant memory space, an O(N) algorithm could be tweaked down to O(log N) by using an internal balanced tree of offsets-to-chunks (where the chunk-size was the size of a block where it was faster to scan linearly than to navigate the tree). One might even endow the algorithm with FSR smarts, so each chunk/fragment could be a different encoding in memory, and linearly iterating over the string would walk the tree, returning each decoded piece. /random_ramblings -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 04/06/2014 12:01, Tim Chase wrote: On 2014-06-04 00:58, Paul Rubin wrote: Steven D'Aprano st...@pearwood.info writes: Maybe there's a use-case for a microcontroller that works in ISO-8859-5 natively, thus using only eight bits per character, That won't even make the Russians happy, since in Russia there are multiple incompatible legacy encodings. I've never understood why not use UTF-8 for everything. If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Some of us slice strings for a living. ;-) I understand that using UTF-32 would allow us to maintain O(1) indexing at the cost of every string occupying 4 bytes per character. The FSR (again, as I understand it) allows strings that fit in one-byte-per-character to use that, scaling up to use wider characters internally as they're actually needed/used. I believe that we should distinguish between glyph/character indexing and string indexing. Even in unicode it may be hard to decide where a visual glyph starts and ends. I assume most people would like to assign one glyph to one unicode, but that's not always possible with composed glyphs. for a in (u'\xc5',u'A\u030a'): ... for o in (u'\xf6',u'o\u0308'): ... u=a+u'ngstr'+o+u'm' ... print(%s %s % (repr(u),u)) ... u'\xc5ngstr\xf6m' Ångström u'\xc5ngstro\u0308m' Ångström u'A\u030angstr\xf6m' Ångström u'A\u030angstro\u0308m' Ångström u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m' False so even unicode doesn't always allow for O(1) glyph indexing. I know this is artificial, but this is the same situation as utf8 faces just the frequency of occurrence is different. A very large amount of computing is still western centric so searching a byte string for latin characters is still efficient; searching for an n with a tilde on top might not be so easy. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Tim Chase python.l...@tim.thechases.com: On 2014-06-04 00:58, Paul Rubin wrote: I've never understood why not use UTF-8 for everything. If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Most string operations are O(N) anyway. Besides, you could try and be smart and keep a recent index cached so simple for loops would be O(N) instead of O(N**2). So the idea of keeping strings internally in UTF-8 might not be all that bad. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Robin Becker ro...@reportlab.com: u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m' False Now *that* would be a valid reason for our resident Unicode expert to complain! Py3 in no way solves text representation issues definitively. I know this is artificial Not at all. It probably is out of scope for Python, but it is a real cause for human suffering. What's Unicode for résumé? Note, for example, that Google manages to sort out issues like these. It sees past diacritics and even case ending. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 2014-06-04 12:53, Robin Becker wrote: If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Some of us slice strings for a living. ;-) I believe that we should distinguish between glyph/character indexing and string indexing. I'm only talking about string indexing using my_string[some_slice] which is traditionally O(1) and breaking that [cw]ould cause unexpected performance degradation. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 2014-06-04 14:57, Marko Rauhamaa wrote: If you use UTF-8 for everything, then you end up in a world where string-indexing (see ChrisA's other side thread on this topic) is no longer an O(1) operation, but an O(N) operation. Most string operations are O(N) anyway. Besides, you could try and be smart and keep a recent index cached so simple for loops would be O(N) instead of O(N**2). So the idea of keeping strings internally in UTF-8 might not be all that bad. As mentioned elsewhere, I've got a LOT of code that expects that string indexing is O(1) and rarely are those strings/offsets reused I'm streaming through customer/provider data files, so caching wouldn't do much good other than waste space and the time to maintain them. If I knew that string indexing was O(something non constant), I'd have retooled my algorithms to take that into consider, but that would be a lot of code I'd need to touch. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On 04/06/2014 13:17, Marko Rauhamaa wrote: . Note, for example, that Google manages to sort out issues like these. It sees past diacritics and even case ending. . I guess they must normalize all inputs to some standard form and then search / eigenvectorize on those. There are quite a few diacritics and a fair few glyphs they could be applied to. I don't think it likely they could map all possible combinations to a private range. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote: I believe that we should distinguish between glyph/character indexing and string indexing. Even in unicode it may be hard to decide where a visual glyph starts and ends. I assume most people would like to assign one glyph to one unicode, but that's not always possible with composed glyphs. for a in (u'\xc5',u'A\u030a'): ... for o in (u'\xf6',u'o\u0308'): ... u=a+u'ngstr'+o+u'm' ... print(%s %s % (repr(u),u)) ... u'\xc5ngstr\xf6m' Ångström u'\xc5ngstro\u0308m' Ångström u'A\u030angstr\xf6m' Ångström u'A\u030angstro\u0308m' Ångström u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m' False so even unicode doesn't always allow for O(1) glyph indexing. What you're talking about here is graphemes, not glyphs. Glyphs are the little pictures that represent the characters when written down. Graphemes (technically, grapheme clusters) are the things which native speakers of a language believe ought to be considered a single unit. Think of them as similar to letters. That can be quite tricky to determine, and is dependent on the language you are speaking. The letters ch are considered two letters in English, but only a single letter in Czech and Slovak. I believe that *grapheme-aware* text processing is *far* too complicated for a programming language to promise. If you think that len() needs to count graphemes, then what should len(ch) return, 1 or 2? Grapheme processing is a complex, complicated task best left up to powerful libraries built on top of a sturdy Unicode base. I know this is artificial, But it isn't artificial in the least. Unicode isn't complicated because it's badly designed, or complicated for the sake of complexity. It's complicated because human language is complicated. That, and because of legacy encodings. but this is the same situation as utf8 faces just the frequency of occurrence is different. A very large amount of computing is still western centric so searching a byte string for latin characters is still efficient; searching for an n with a tilde on top might not be so easy. This is a good point, but on balance I disagree. A grapheme-aware library is likely to need to be based on more complex data structures than simple strings (arrays of code points). But for the underlying relatively simple string library, graphemes are too hard. Code points are simple, and the language can deal with code points without caring about their semantics. For instance, in English, I might not want to insert letters between the q and u of queen, since in English u (nearly) always follows q. It would be inappropriate for the programming language string library to care about that, and similarly it would be inappropriate for it to care that u'A\u030a' represents a single grapheme Å. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Tim Chase python.l...@tim.thechases.com writes: As mentioned elsewhere, I've got a LOT of code that expects that string indexing is O(1) and rarely are those strings/offsets reused I'm streaming through customer/provider data files, so caching wouldn't do much good other than waste space and the time to maintain them. I'm having trouble understanding -- if they're only used once then what's the problem? You're reading some enormous file into a string and then randomly accessing it by character offset? What size are these strings? I can think of a number of workarounds including language extensions, but mostly I'd be interested in seeing some actual benchmarks of your unmodified program under both representations. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
In article mailman.10673.1401853976.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: You can't ignore those. You might be able to say Well, my program will run slower if you throw these at it, but if you're going down that route, you probably want the full FSR and the advantages it confers on ASCII and Latin-1 strings. Binding your program to BMP-only is nearly as dangerous as binding it to ASCII-only; potentially worse, because you can run an awful lot of artificial tests without remembering to stick in some astral characters. Yup. I wrote a while(*) back about the pain I was having importing some data into a MySQL(**) database which (unknown to me when I started) only handled BMP. It turns out in the entire dataset of 20-odd million records, there were exactly four that had astral characters. All of my tests worked. I didn't discover the problem until it blew up many hours into the final production import run. (*) Two years? (**) This was not the only pain point with MySQL. We eventually switched to Postgress. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote: Chris Angelico wrote: You can't ignore those. You might be able to say Well, my program will run slower if you throw these at it, but if you're going down that route, you probably want the full FSR and the advantages it confers on ASCII and Latin-1 strings. Binding your program to BMP-only is nearly as dangerous as binding it to ASCII-only; potentially worse, because you can run an awful lot of artificial tests without remembering to stick in some astral characters. Yup. I wrote a while(*) back about the pain I was having importing some data into a MySQL(**) database which (unknown to me when I started) only handled BMP. It turns out in the entire dataset of 20-odd million records, there were exactly four that had astral characters. All of my tests worked. I didn't discover the problem until it blew up many hours into the final production import run. (*) Two years? (**) This was not the only pain point with MySQL. We eventually switched to Postgress. Thanks Roy for bringing up that example - I was trying to recollect the details. I forgot about the MySQL angle which adds a different twist to it. Here's my interpretation of that situation; I'd like to hear yours: Basic problem was that MySQL handled a strict subset of what the rest of the system (Python 2.7?) could handle. This meant that at a late (and embarrassing) stage, exceptions were being thrown, from deep within the system. OTOH, let's say you could detect the 'error' (more correctly 'un-handle-able') at the borders of your system, say when the user enters the data on a web-form. Would you have a problem kicking out those characters (in both senses!) with a curt: Cant deal with all this supra-galactic rubble! ? Of course switching to postgres may be a sound choice on other fronts. But if that were not an option, and you only had these choices: - significantly complexify your MySQL data structures to handle 4 in 20 million cases - just detect and throw such cases out at the outset which would you take? In any case this is the choice I hear from the micropython folks who are explicitly seeking a cutdown version of python -- https://mail.python.org/mailman/listinfo/python-list
Micro Python -- a lean and efficient implementation of Python 3
Hi, We would like to announce Micro Python, an implementation of Python 3 optimised to have a low memory footprint. While Python has many attractive features, current implementations (read CPython) are not suited for embedded devices, such as microcontrollers and small systems-on-a-chip. This is because CPython uses an awful lot of RAM -- both stack and heap -- even for simple things such as integer addition. Micro Python is a new implementation of the Python 3 language, which aims to be properly compatible with CPython, while sporting a very minimal RAM footprint, a compact compiler, and a fast and efficient runtime. These goals have been met by employing many tricks with pointers and bit stuffing, and placing as much as possible in read-only memory. Micro Python has the following features: - Supports almost full Python 3 syntax, including yield (compiles 99.99% of the Python 3 standard library). - Most scripts use significantly less RAM in Micro Python, and various benchmark programs run faster, compared with CPython. - A minimal ARM build fits in 80k of program space, and with all features enabled it fits in around 200k on Linux. - Micro Python needs only 2k RAM for a basic REPL. - It has 2 modes of AOT (ahead of time) compilation to native machine code, doubling execution speed. - There is an inline assembler for use in time-critical microcontroller applications. - It is written in C99 ANSI C and compiles cleanly under Unix (POSIX), Mac OS X, Windows and certain ARM based microcontrollers. - It supports a growing subset of Python 3 types and operations. - Part of the Python 3 standard library has already been ported to Micro Python, and work is ongoing to port as much as feasible. More info at: http://micropython.org/ You can follow the progress and contribute at github: www.github.com/micropython/micropython www.github.com/micropython/micropython-lib -- Damien / Micro Python team. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, Jun 3, 2014 at 10:27 PM, Damien George damien.p.geo...@gmail.com wrote: - Supports almost full Python 3 syntax, including yield (compiles 99.99% of the Python 3 standard library). - It supports a growing subset of Python 3 types and operations. - Part of the Python 3 standard library has already been ported to Micro Python, and work is ongoing to port as much as feasible. I don't have an actual use-case for this, as I don't target microcontrollers, but I'm curious: What parts of Py3 syntax aren't supported? And since you say port as much as feasible, presumably there'll be parts that are never supported. Are there some syntactic elements that just take up way too much memory? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, 03 Jun 2014 13:27:11 +0100, Damien George wrote: Hi, We would like to announce Micro Python, an implementation of Python 3 optimised to have a low memory footprint. Fantastic! -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Hello, On Tue, 3 Jun 2014 23:11:46 +1000 Chris Angelico ros...@gmail.com wrote: On Tue, Jun 3, 2014 at 10:27 PM, Damien George damien.p.geo...@gmail.com wrote: - Supports almost full Python 3 syntax, including yield (compiles 99.99% of the Python 3 standard library). - It supports a growing subset of Python 3 types and operations. - Part of the Python 3 standard library has already been ported to Micro Python, and work is ongoing to port as much as feasible. I don't have an actual use-case for this, as I don't target microcontrollers, Please let me chime in, as one of MicroPython contributors. I also don't have immediate usecase for a Python microcontroller (but seeing how fast industry moves, I won't be surprised if in half-year it will seem just right). Instead, I treat MicroPython as a Python implementation which scales *down* very well. With current situation in the industry, people mostly care about scaling up - consume more gigabytes and gigahertz, catch more clouds and include heavier and heavier batteries. MicroPython goes another direction. You don't have to use it on a microcontroller. It's just if you want/need it, you'll be able - while still staying with your favorite language. I'm personally interested in using MicroPython on a small embedded Linux systems, like home routers, Internet-of-Thing devices, etc. Such devices usually have just few hundreds of megahertz of CPU power, and 2-4MB of flash. And to cut cost, the lower bound decreases all the time. but I'm curious: What parts of Py3 syntax aren't supported? And since you say port as much as feasible, presumably there'll be parts that are never supported. Are there some syntactic elements that just take up way too much memory? Syntax-wise, all Python 3.3 syntax is supported. This includes things like yield from, annotations, etc. For example: $ micropython Micro Python v1.0.1-139-g411732e on 2014-06-03; UNIX version def foo(a:int) - float: ... return float(a) ... foo(4) 4.0 99.9% statement is due to fact that there were some problems parsing couple of files in CPython 3.3/3.4 stdlib. Note that above talks about syntax, not semantics. Though core language semantics is actually now implemented pretty well. For example, yield from works pretty well, so asyncio could work ;-). (Except my analysis showed that CPython's implementation is a bit bloated for MicroPython requirements, so I started to write a simplified implementation from scratch). As can be seen from the dump above, MicroPython perfectly works on a Linux system, so we encourage any pythonista to touch a little bit of Python magic and give it a try! ;-) And we of course interested to get feedback how portable it is, etc. (As a side note, it's of course possible to compile and run MicroPython on Windows too, it's a bit more complicated than just make.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list -- Best regards, Paul mailto:pmis...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 2:49 AM, Paul Sokolovsky pmis...@gmail.com wrote: As can be seen from the dump above, MicroPython perfectly works on a Linux system, so we encourage any pythonista to touch a little bit of Python magic and give it a try! ;-) And we of course interested to get feedback how portable it is, etc. With that encouragement, I just cloned your repo and built it on amd64 Debian Wheezy. Works just fine! Except... I've just found one fairly major problem with your support of Python 3.x syntax. Your str type is documented as not supporting Unicode. Is that a current flaw that you're planning to remove, or a design limitation? Either way, I'm a bit dubious about a purported version 1 that doesn't do one of the things that Py3 is especially good at - matched by very few languages in its encouragement of best practice with Unicode support. What is your str type actually able to support? It seems to store non-ASCII bytes in it, which I presume are supposed to represent the rest of Latin-1, but I wasn't able to print them out: Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version print(asdf\xfdqwer) Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09) [GCC 4.7.2] on linux print(asdf\xfdqwer) asdfýqwer In fact, printing seems to work with bytes: print(asdf\xc3\xbdqwer) asdfýqwer (my terminal uses UTF-8, this is the UTF-8 encoding of the above string) I would strongly recommend either implementing all of PEP 393, or at least making it very clear that this pretends everything is bytes - and possibly disallowing any codepoint 127 in any string, which will at least mean you're safe on all ASCII-compatible encodings. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
Hello, On Wed, 4 Jun 2014 03:08:57 +1000 Chris Angelico ros...@gmail.com wrote: [] With that encouragement, I just cloned your repo and built it on amd64 Debian Wheezy. Works just fine! Except... I've just found one fairly major problem with your support of Python 3.x syntax. Your str type is documented as not supporting Unicode. Is that a current flaw that you're planning to remove, or a design limitation? Either way, I'm a bit dubious about a purported version 1 that doesn't do one of the things that Py3 is especially good at - matched by very few languages in its encouragement of best practice with Unicode support. I should start with saying that it's MicroPython what made me look at Python3. So for me, it already did lot of boon by getting me from under the rock, so now instead of at my job, we use python 2.x I may report at my job, we don't wait when our distro will kick us in the ass, and add 'from __future__ import print_function' whenever we touch some code. With that in mind, I, as many others, think that forcing Unicode bloat upon people by default is the most controversial feature of Python3. The reason is that you go very long way dealing with languages of the people of the world by just treating strings as consisting of 8-bit data. I'd say, that's enough for 90% of applications. Unicode is needed only if one needs to deal with multiple languages *at the same time*, which is fairly rare (remaining 10% of apps). And please keep in mind that MicroPython was originally intended (and should be remain scalable down to) an MCU. Unicode needed there is even less, and even less resources to support Unicode just because. What is your str type actually able to support? It seems to store non-ASCII bytes in it, which I presume are supposed to represent the rest of Latin-1, but I wasn't able to print them out: There's a work-in-progress on documenting differences between CPython and MicroPython at https://github.com/micropython/micropython/wiki/Differences, it gives following account on this: No unicode support is actually implemented. Python3 calls for strict difference between str and bytes data types (unlike Python2, which has neutral unified data type for strings and binary data, and separates out unicode data type). MicroPython faithfully implements str/bytes separation, but currently, underlying str implementation is the same as bytes. This means strings in MicroPython are not unicode, but 8-bit characters (fully binary-clean). Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version print(asdf\xfdqwer) Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09) [GCC 4.7.2] on linux print(asdf\xfdqwer) asdfýqwer In fact, printing seems to work with bytes: print(asdf\xc3\xbdqwer) asdfýqwer (my terminal uses UTF-8, this is the UTF-8 encoding of the above string) I would strongly recommend either implementing all of PEP 393, or at least making it very clear that this pretends everything is bytes - and possibly disallowing any codepoint 127 in any string, which will at least mean you're safe on all ASCII-compatible encodings. MicroPython is not the first tiny Python implementation. What differs MicroPython is that it's neither aim or motto to be a subset of language. And yet, it's not CPython rewrite either. So, while Unicode support is surely possible, it's unlikely to be done as all of PEPxxx. If you ask me, I'd personally envision it to be implemented as UTF-8 (in this regard I agree with (or take an influence from) http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/). But I don't have plans to work on Unicode any time soon - applications I envision for MicroPython so far fit in those 90% that live happily without Unicode. But generally, there's no strict roadmap for MicroPython features. While core of the language (parser, compiler, VM) is developed by Damien, many other features were already contributed by the community (project went open-source at the beginning of the year). So, if someone will want to see Unicode support up to the level of providing patches, it gladly will be accepted. The only thing we established is that we want to be able to scale down, and thus almost all features should be configurable. ChrisA -- https://mail.python.org/mailman/listinfo/python-list -- Best regards, Paul mailto:pmis...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 7:41 AM, Paul Sokolovsky pmis...@gmail.com wrote: Hello, On Wed, 4 Jun 2014 03:08:57 +1000 Chris Angelico ros...@gmail.com wrote: [] With that encouragement, I just cloned your repo and built it on amd64 Debian Wheezy. Works just fine! Except... I've just found one fairly major problem with your support of Python 3.x syntax. Your str type is documented as not supporting Unicode. Is that a current flaw that you're planning to remove, or a design limitation? Either way, I'm a bit dubious about a purported version 1 that doesn't do one of the things that Py3 is especially good at - matched by very few languages in its encouragement of best practice with Unicode support. I should start with saying that it's MicroPython what made me look at Python3. So for me, it already did lot of boon by getting me from under the rock, so now instead of at my job, we use python 2.x I may report at my job, we don't wait when our distro will kick us in the ass, and add 'from __future__ import print_function' whenever we touch some code. And that's a good thing :) Using Python 2.7 and starting to put in the future directives breaks nothing, and will save you time later. With that in mind, I, as many others, think that forcing Unicode bloat upon people by default is the most controversial feature of Python3. The reason is that you go very long way dealing with languages of the people of the world by just treating strings as consisting of 8-bit data. I'd say, that's enough for 90% of applications. Unicode is needed only if one needs to deal with multiple languages *at the same time*, which is fairly rare (remaining 10% of apps). Absolutely not. This is the mentality that results in web applications that break on funny characters, which is completely the wrong way to look at it. The truth is, there are not many funny characters in Unicode at all; I found these, but that's about it: http://www.fileformat.info/info/unicode/char/1F601/index.htm http://www.fileformat.info/info/unicode/char/1F638/index.htm Your code should accept any valid character with equal correctness. (Note to jmf: Correctness does not necessarily imply exact nanosecond performance, just that the right result is reached.) These days, Unicode *is* needed everywhere. You might think you can get away with 8-bit data, but is that 8-bit data actually encoded Latin-1 or UTF-8? There's a vast difference between them, and you'll hit it in any English text with U+00A9 ©, or U+201C U+201D quotes, or any of a large number of other common non-ASCII characters. Oh, and the three I just mentioned happen to be in CP-1252, another common 8-bit encoding, and a lot of people and programs don't know how to tell CP-1252 from Latin-1 and label one as the other. Unicode is needed on anything that touches the internet, which is a *lot* more than 10% of applications. Unicode is also needed on anything that shares files with anyone who speaks more than one language, or uses any symbol that isn't in ASCII, or pretty much anything beyond plain English with a restricted set of punctuation. And even if you can guarantee that you're working only with English and only with ASCII, you still need to be aware that ASCII text is different stuff from a JPEG file, although it's possible to bury your head in the sand over that one. But generally, there's no strict roadmap for MicroPython features. While core of the language (parser, compiler, VM) is developed by Damien, many other features were already contributed by the community (project went open-source at the beginning of the year). So, if someone will want to see Unicode support up to the level of providing patches, it gladly will be accepted. The only thing we established is that we want to be able to scale down, and thus almost all features should be configurable. And that's exactly what's happening right now. https://github.com/micropython/micropython/issues/657 https://github.com/Rosuav/micropython ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: With that in mind, I, as many others, think that forcing Unicode bloat upon people by default is the most controversial feature of Python3. The reason is that you go very long way dealing with languages of the people of the world by just treating strings as consisting of 8-bit data. I'd say, that's enough for 90% of applications. Unicode is needed only if one needs to deal with multiple languages *at the same time*, which is fairly rare (remaining 10% of apps). And please keep in mind that MicroPython was originally intended (and should be remain scalable down to) an MCU. Unicode needed there is even less, and even less resources to support Unicode just because. At some time (when jmf was making more intelligible noises) I had suggested that the choice between 1/2/4 byte strings that happens at runtime in python3's FSR can be made at python-start time with a command-line switch. There are many combinations here; here is one in more detail: Instead of having one (FSR) string engine, you have (upto) 4 - a pure 1 byte (ASCII) - a pure 2 byte (BMP) with decode-failures for out-of-ranges - a pure 4 byte -- everything UTF-32 - FSR dynamic switching at runtime (with massive moping from the world's jmfs) The point is that only one of these engines would be brought into memory based on command-line/config options. Some more personal thoughts (that may be quite ill-informed!): 1. I regard myself as a unicode ignoramus+enthusiast. The world will be a better place if unicode is more pervasive. See http://blog.languager.org/2014/04/unicoded-python.html As it happens I am also a computer scientist -- I understand that in contexts where anything other than 8-bit chars is unacceptably inefficient, unicode-bloat may be a real thing. 2. My casual/cursory reading of the contents of the SMP-planes suggests that the stuff there is are things like - egyptian hieroplyphics - mahjong characters - ancient greek musical symbols - alchemical symbols etc etc. IOW from pov of a universallly acceptable character set this is mostly rubbish And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody rustompm...@gmail.com wrote: 2. My casual/cursory reading of the contents of the SMP-planes suggests that the stuff there is are things like - egyptian hieroplyphics - mahjong characters - ancient greek musical symbols - alchemical symbols etc etc. IOW from pov of a universallly acceptable character set this is mostly rubbish And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] Not if you're working on the internet. There are several critical groups of characters that aren't in the BMP, such as: 1) Most or all Chinese and Japanese characters 2) Heaps of emoticons and fancy letters 3) Mathematical symbols You can't ignore those. You might be able to say Well, my program will run slower if you throw these at it, but if you're going down that route, you probably want the full FSR and the advantages it confers on ASCII and Latin-1 strings. Binding your program to BMP-only is nearly as dangerous as binding it to ASCII-only; potentially worse, because you can run an awful lot of artificial tests without remembering to stick in some astral characters. It's not rubbish. It's important stuff that you need to deal with. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote: On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote: And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] Not if you're working on the internet. There are several critical groups of characters that aren't in the BMP, such as: Of course. But what has the internet to do with micropython? This is their stated goal: | Micro Python is a lean and fast implementation of the Python | programming language (python.org) that is optimised to run on a | microcontroller. 1) Most or all Chinese and Japanese characters Dont know how you count 'most' | One possible rationale is the desire to limit the size of the full | Unicode character set, where CJK characters as represented by discrete | ideograms may approach or exceed 100,000 (while those required for | ordinary literacy in any language are probably under 3,000). Version 1 | of Unicode was designed to fit into 16 bits and only 20,940 characters | (32%) out of the possible 65,536 were reserved for these CJK Unified | Ideographs. Later Unicode has been extended to 21 bits allowing many | more CJK characters (75,960 are assigned, with room for more). | From http://en.wikipedia.org/wiki/Han_unification -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody rustompm...@gmail.com wrote: 1) Most or all Chinese and Japanese characters Dont know how you count 'most' | One possible rationale is the desire to limit the size of the full | Unicode character set, where CJK characters as represented by discrete | ideograms may approach or exceed 100,000 (while those required for | ordinary literacy in any language are probably under 3,000). Version 1 | of Unicode was designed to fit into 16 bits and only 20,940 characters | (32%) out of the possible 65,536 were reserved for these CJK Unified | Ideographs. Later Unicode has been extended to 21 bits allowing many | more CJK characters (75,960 are assigned, with room for more). | From http://en.wikipedia.org/wiki/Han_unification So there are 20,940 CJK characters in the BMP, and approximately 55,000 more in the SIP. I'd count 55,000 out of 75,960 as most. -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote: On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: With that in mind, I, as many others, think that forcing Unicode bloat upon people by default is the most controversial feature of Python3. The reason is that you go very long way dealing with languages of the people of the world by just treating strings as consisting of 8-bit data. I'd say, that's enough for 90% of applications. Unicode is needed only if one needs to deal with multiple languages *at the same time*, which is fairly rare (remaining 10% of apps). And please keep in mind that MicroPython was originally intended (and should be remain scalable down to) an MCU. Unicode needed there is even less, and even less resources to support Unicode just because. At some time (when jmf was making more intelligible noises) I had suggested that the choice between 1/2/4 byte strings that happens at runtime in python3's FSR can be made at python-start time with a command-line switch. There are many combinations here; here is one in more detail: Instead of having one (FSR) string engine, you have (upto) 4 - a pure 1 byte (ASCII) There are only 128 ASCII characters, so a pure ASCII implementation cannot even represent arbitrary bytes. - a pure 2 byte (BMP) with decode-failures for out-of-ranges That's not Unicode. It's a subset of Unicode. - a pure 4 byte -- everything UTF-32 For embedded devices, that would be extremely memory hungry. Remember, every variable, every attribute name, every method and class and function name is a string. Using at least 56 bytes just to refer to sys.stdout.write will be painful. - FSR dynamic switching at runtime (with massive moping from the world's jmfs) Please stop giving JMF's crackpot opinion even the dignity of being sneered at. [...] 2. My casual/cursory reading of the contents of the SMP-planes suggests that the stuff there is are things like - egyptian hieroplyphics - mahjong characters - ancient greek musical symbols - alchemical symbols etc etc. IOW from pov of a universallly acceptable character set this is mostly rubbish Certainly some of these things are more whimsical than practical, but it doesn't really matter. Even if you strip out every bit of whimsy from the Unicode character set, you're still left with needing more than 65536 characters (16 bits). For efficiency you aren't going to use 17 bits, or 18, or 19, so it's actually faster and more efficient to jump right to 32 bits. For technical reasons which I don't fully understand, Unicode only uses 21 of those 32 bits, giving a total of 1114112 available code points. Whether you or I personally have need for alchemical symbols, *some people* do, and supporting their use-case doesn't harm us by one bit. And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] At the cost on one extra bit, strings could use UTF-16 internally and still have correct behaviour. The bit could be a flag recording whether the string contains any surrogate pairs. If the flag was 0, all string operations could assume a constant 2-bytes-per-character. If the flag was 1, it could fall back to walking the string checking for surrogate pairs. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Micro Python -- a lean and efficient implementation of Python 3
On Wednesday, June 4, 2014 10:50:21 AM UTC+5:30, Steven D'Aprano wrote: On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote: And so a pure BMP-supporting implementation may be a reasonable compromise. [As long as no surrogate-pairs are there] At the cost on one extra bit, strings could use UTF-16 internally and still have correct behaviour. The bit could be a flag recording whether the string contains any surrogate pairs. If the flag was 0, all string operations could assume a constant 2-bytes-per-character. If the flag was 1, it could fall back to walking the string checking for surrogate pairs. Yes. That could be one possibility. My main reason in giving the 4-engine choice was not that 4 engines are a good idea but that in the very differently constrained world of μ-controllers playing around with alternate binding times may be advantageous On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote: With that in mind, I, as many others, think that forcing Unicode bloat upon people by default is the most controversial feature of Python3. The reason is that you go very long way dealing with languages of the people of the world by just treating strings as consisting of 8-bit data. I'd say, that's enough for 90% of applications. Unicode is needed only if one needs to deal with multiple languages *at the same time*, which is fairly rare (remaining 10% of apps). And please keep in mind that MicroPython was originally intended (and should be remain scalable down to) an MCU. Unicode needed there is even less, and even less resources to support Unicode just because. At some time (when jmf was making more intelligible noises) I had suggested that the choice between 1/2/4 byte strings that happens at runtime in python3's FSR can be made at python-start time with a command-line switch. There are many combinations here; here is one in more detail: Instead of having one (FSR) string engine, you have (upto) 4 - a pure 1 byte (ASCII) There are only 128 ASCII characters, so a pure ASCII implementation cannot even represent arbitrary bytes. Yes this is a subtle point. I was initially going to write Latin-1. Wrote a rough-n-ready ASCII. But maybe it could be a choice. I really dont understand the binding-times of μ-controllers. My impression is that actual development is split 1 tinkering with the board 2 working on full powered computers and downloading to the board In going from 2 to 1 heavy amounts of cut-downs are probably possible and desirable. If this is the case, having hooks in the system for making choices may be a good idea optimal choices may be worthwhile -- https://mail.python.org/mailman/listinfo/python-list