Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread alister
On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote:

 On 11 June 2014 05:43, alister alister.nospam.w...@ntlworld.com wrote:
 
 
 Your error reports always seem to resolve around benchmarks despite
 speed not being one of Pythons prime objectives


 By his own admission, jmf doesn't use Python anymore. His only reason to
 remain on this emailing/newsgroup is to troll about the FSR. Please
 don't reply to him (and preferably add him to your killfile).
 

I couldn't kill file JMF I find his posts useful
Every time i find myself agreeing with him I know I have got it wrong.



-- 
The nice thing about Windows is - It does not just crash, it displays a
dialog box and lets you press 'OK' first.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread Ben Finney
alister alister.nospam.w...@ntlworld.com writes:

 On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote:
  By his own admission, jmf doesn't use Python anymore. His only
  reason to remain on this emailing/newsgroup is to troll about the
  FSR. Please don't reply to him (and preferably add him to your
  killfile).

 I couldn't kill file JMF I find his posts useful

That's fine, kill-filing his posts is a matter that affects only you.

But please do not reply to them, nor taunt him in unrelated posts; it
disrupts this forum.
Instead, give him no reason to think anyone is interested.

-- 
 \ “Too many pieces of music finish too long after the end.” —Igor |
  `\   Stravinskey |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread Michael Torrie
On 06/10/2014 01:43 PM, alister wrote:
 On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:
 BTW, very easy to explain.

Yeah he keeps saying that, but he never does explain--just flails around
and mumbles unicode.org.  Guess everyone has to have his or her
windmill to tilt at.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread alister
On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:

 Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit :
 On 2014-06-06 09:59, Travis Griggs wrote:
 
  On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
 
   If you use UTF-8 for everything
 
 
  
  It seems to me, that increasingly other libraries (C, etc), use
 
  utf8 as the preferred string interchange format.
 
 
 
 I definitely advocate UTF-8 for any streaming scenario, as you're
 
 iterating unidirectionally over the data anyways, so why use/transmit
 
 more bytes than needed.  The only failing of UTF-8 that I've found in
 
 the real world(*) is when you have to requirement of constant-time
 
 indexing into strings.
 
 
 
 -tkc
 
 And once again, just an illustration,
 
 timeit.repeat((x*1000 + y), setup=x = 'abc'; y = 'z')
 [0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
 timeit.repeat((x*1000 + y), setup=x = 'abc'; y = '\u0fce')
 [2.5541921791045183, 2.52434366066052, 2.5337417948967413]
 timeit.repeat((x*1000 + y), setup=x = 'abc'.encode('utf-8'); y =
 'z'.encode('utf-8'))
 [0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
 timeit.repeat((x*1000 + y), setup=x = 'abc'.encode('utf-8'); y =
 '\u0fce'.encode('utf-8'))
 [0.9320969737165115, 0.9086006535332558, 0.9051715140790861]
 
 
 sys.getsizeof('abc'*1000 + '\u0fce')
 6040
 sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8'))
 3020


 
 But you know, that's not the problem.
 
 When a see a core developper discussing benchmarking,
 when the same application using non ascii chars become 1, 2, 5, 10, 20
 if not more, slower comparing to pure ascii, I'm wondering if there is
 not a serious problem somewhere.
 
 (and also becoming slower that Py3.2)
 
 BTW, very easy to explain.
 
 I do not understand why the free, open, what-you-wish-here, ... 
 software is so often pushing to the adoption of serious corporate
 products.
 
 jmf

Your error reports always seem to resolve around benchmarks despite speed 
not being one of Pythons prime objectives

Computers store data using bytes
ASCII Characters can be used storing a single byte
Unicode code-points cannot be stored in a single byte
therefore Unicode will always be inherently slower than ASCII

implementation details mean that some Unicode characters may be handled 
more efficiently than others, why is this wrong?
why should all Unicode operations be equally slow?



-- 
There isn't any problem
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Tim Delaney
On 11 June 2014 05:43, alister alister.nospam.w...@ntlworld.com wrote:


 Your error reports always seem to resolve around benchmarks despite speed
 not being one of Pythons prime objectives


By his own admission, jmf doesn't use Python anymore. His only reason to
remain on this emailing/newsgroup is to troll about the FSR. Please don't
reply to him (and preferably add him to your killfile).

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Mark Lawrence

On 10/06/2014 20:43, alister wrote:

On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:



[snip the garbage]



jmf


Your error reports always seem to resolve around benchmarks despite speed
not being one of Pythons prime objectives

Computers store data using bytes
ASCII Characters can be used storing a single byte
Unicode code-points cannot be stored in a single byte
therefore Unicode will always be inherently slower than ASCII

implementation details mean that some Unicode characters may be handled
more efficiently than others, why is this wrong?
why should all Unicode operations be equally slow?



I'd like to dedicate a song to jmf.  From the Canterbury Sound band 
Caravan, the album The Battle Of Hastings, the song title Liar.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Devin Jeanpierre
Please don't be unnecessarily cruel and antagonistic.

-- Devin

On Tue, Jun 10, 2014 at 4:16 PM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 On 10/06/2014 20:43, alister wrote:

 On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:


 [snip the garbage]



 jmf


 Your error reports always seem to resolve around benchmarks despite speed
 not being one of Pythons prime objectives

 Computers store data using bytes
 ASCII Characters can be used storing a single byte
 Unicode code-points cannot be stored in a single byte
 therefore Unicode will always be inherently slower than ASCII

 implementation details mean that some Unicode characters may be handled
 more efficiently than others, why is this wrong?
 why should all Unicode operations be equally slow?


 I'd like to dedicate a song to jmf.  From the Canterbury Sound band
 Caravan, the album The Battle Of Hastings, the song title Liar.

 --
 My fellow Pythonistas, ask not what our language can do for you, ask what
 you can do for our language.

 Mark Lawrence

 ---
 This email is free from viruses and malware because avast! Antivirus
 protection is active.
 http://www.avast.com


 --
 https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Steven D'Aprano
On Tue, 10 Jun 2014 19:43:13 +, alister wrote:

 On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:

Please don't feed the troll.

I don't know whether JMF is trolling or if he is a crank who doesn't 
understand what he is doing, but either way he's been trying to square 
this circle for the last couple of years. He believes, or *claims* to 
believe, that a performance regression (one which others cannot 
replicate) is *mathematical proof* that Python's Unicode handling is 
invalid. What can one say to crack-pottery of this magnitude?

Just kill-file his posts and be done.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Ethan Furman

On 06/10/2014 04:29 PM, Devin Jeanpierre wrote:


Please don't be unnecessarily cruel and antagonistic.


I completely agree.  jmf should leave us alone and stop cruelly and 
antagonizingly baiting us with stupidity and falsehoods.

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Mark Lawrence

On 11/06/2014 00:29, Devin Jeanpierre wrote:

Please don't be unnecessarily cruel and antagonistic.

-- Devin


I am simply giving our resident unicode expert a taste of his own 
medicine.  If you don't like that complain to the PSF about the root 
cause of the problem, not the symptoms.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Anssi Saari
Chris Angelico ros...@gmail.com writes:
 
 I don't have an actual use-case for this, as I don't target
 microcontrollers, but I'm curious: What parts of Py3 syntax aren't
 supported?

I meant to say % formatting for strings but that's apparently been added
recently. My previous micropython build was from February.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Travis Griggs

On Jun 4, 2014, at 4:01 AM, Tim Chase python.l...@tim.thechases.com wrote:

 If you use UTF-8 for everything

It seems to me, that increasingly other libraries (C, etc), use utf8 as the 
preferred string interchange format. It’s universal, not prone to endian 
issues, etc. So one *advantage* you gain for using utf8 internally, is any time 
you need to hand a string to an external thing, it’s just ready. An app that 
reserves its internal string processing to streaming based ones but has to to 
hand strings to external libraries a lot (e.g. cairo) might actually benefit 
using utf8 internally, because a) it’s not doing the linear search for the odd 
character address and b) it no longer needs to decode/encode every time it 
sends or receives a string to an external library.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Roy Smith
In article mailman.10822.1402073958.18130.python-l...@python.org,
 Travis Griggs travisgri...@gmail.com wrote:

 On Jun 4, 2014, at 4:01 AM, Tim Chase python.l...@tim.thechases.com wrote:
 
  If you use UTF-8 for everything
 
 It seems to me, that increasingly other libraries (C, etc), use utf8 as the 
 preferred string interchange format. It¹s universal, not prone to endian 
 issues, etc.

One of the important etc factors is, Since it's the most commonly used, 
it's the one that other people are most likely to have implemented 
correctly.  In the real world, these are important considerations.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Tim Chase
On 2014-06-06 09:59, Travis Griggs wrote:
 On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
  If you use UTF-8 for everything
 
 It seems to me, that increasingly other libraries (C, etc), use
 utf8 as the preferred string interchange format.

I definitely advocate UTF-8 for any streaming scenario, as you're
iterating unidirectionally over the data anyways, so why use/transmit
more bytes than needed.  The only failing of UTF-8 that I've found in
the real world(*) is when you have to requirement of constant-time
indexing into strings.

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-05 Thread Roy Smith
In article f935e85f-f86a-4821-86ab-3ab7e5e21...@googlegroups.com,
 Rustom Mody rustompm...@gmail.com wrote:

 On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
  Yup.  I wrote a while(*) back about the pain I was having importing some 
  data into a MySQL(**) database

 Here's my interpretation of that situation; I'd like to hear yours:
 
 Basic problem was that MySQL handled a strict subset of what the rest
 of the system (Python 2.7?)  could handle.

Yes.  This was not a Python issue.  I was just responding to ChrisA's 
statement:

 Binding your program to BMP-only is nearly as dangerous as binding 
 it to ASCII-only; potentially worse, because you can run an awful 
 lot of artificial tests without remembering to stick in some astral 
 characters.


 Of course switching to postgres may be a sound choice on other fronts.
 But if that were not an option, and you only had these choices:
 
 - significantly complexify your MySQL data structures to handle 4 in
   20 million cases
 - just detect and throw such cases out at the outset
 
 which would you take?

It turns out, we could have upgraded to a newer version of MySQL, which 
did handle astral characters correctly.  But, what we did was discarded 
the records containing non-BMP data.  Of course, that's a decision that 
can only be made when you understand the business requirements.  In our 
case, discarding those four records had no impact on our business, so it 
made sense.  For other people, not having the full dataset might have 
been a fatal problem.

This was just one of many MySQL problems we ran into.  Eventually, we 
decided it wasn't worth fighting with what was obviously a brain-dead 
system, and switched databases.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-05 Thread Chris Angelico
On Thu, Jun 5, 2014 at 11:59 PM, Roy Smith r...@panix.com wrote:
 It turns out, we could have upgraded to a newer version of MySQL, which
 did handle astral characters correctly.  But, what we did was discarded
 the records containing non-BMP data.  Of course, that's a decision that
 can only be made when you understand the business requirements.  In our
 case, discarding those four records had no impact on our business, so it
 made sense.  For other people, not having the full dataset might have
 been a fatal problem.

 This was just one of many MySQL problems we ran into.  Eventually, we
 decided it wasn't worth fighting with what was obviously a brain-dead
 system, and switched databases.

Point to note: It's not just Avoid MySQL version x.y.z, it's buggy,
but Make sure you're on a sufficiently new version of MySQL *and then
use these settings*. For instance, the MySQL utf8
locale/collation/charset (not sure what it calls it) supports only the
BMP; you have to use utf8mb4, which is UTF-8 that's allowed to go as
far as four bytes long.

What were they thinking?

What, were they thinking?

I understand there's now an alias utf8mb3 for the buggy utf8, with
some theory that some future version of MySQL might make utf8 become
an alias for utf8mb4. But when would you ever actually *demand* this
buggy behaviour? Why not just say as of this version, utf8 is
identical to utf8mb4, which was a superset thereof, and if anything
changes or breaks, just acknowledge that it used to be buggy?

/rant

Use PostgreSQL.

/obvious

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Ian Kelly
On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info wrote:
 For technical reasons which I don't fully understand, Unicode only
 uses 21 of those 32 bits, giving a total of 1114112 available code
 points.

I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
sufficient to encode up to 16 supplementary planes, so if Unicode were
allowed to grow any larger than that, UTF-16 would no longer be able to
encode all codepoints.

Another benefit of fixing the size is that it frees the other 11 bits per
character of UTF-32 for packing in ancillary data.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Terry Reedy

On 6/4/2014 1:55 AM, Ian Kelly wrote:


On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info
mailto:st...@pearwood.info wrote:
  For technical reasons which I don't fully understand, Unicode only
  uses 21 of those 32 bits, giving a total of 1114112 available code
  points.

I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
sufficient to encode up to 16 supplementary planes, so if Unicode were
allowed to grow any larger than that, UTF-16 would no longer be able to
encode all codepoints.


I believe the original utf-8 used up to 6 bytes per char to encode 2**32 
potential chars. Just 4 bytes limits to 2**21 and for whatever reason 
(easier decoding?), utf-8 was revised down (unusual ;-).


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico
On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy tjre...@udel.edu wrote:
 On 6/4/2014 1:55 AM, Ian Kelly wrote:


 On Jun 3, 2014 11:27 PM, Steven D'Aprano st...@pearwood.info
 mailto:st...@pearwood.info wrote:
   For technical reasons which I don't fully understand, Unicode only
   uses 21 of those 32 bits, giving a total of 1114112 available code
   points.

 I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
 sufficient to encode up to 16 supplementary planes, so if Unicode were
 allowed to grow any larger than that, UTF-16 would no longer be able to
 encode all codepoints.


 I believe the original utf-8 used up to 6 bytes per char to encode 2**32
 potential chars. Just 4 bytes limits to 2**21 and for whatever reason
 (easier decoding?), utf-8 was revised down (unusual ;-).

I understood it to be UTF-16's fault, per Ian's statement. That is to
say, the entire Unicode standard was warped around the problem that
some people were going around thinking a character is 16 bits, even
though that's just as fallacious as a character is 8 bits.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico
On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody rustompm...@gmail.com wrote:
 On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
  And so a pure BMP-supporting implementation may be a reasonable
  compromise. [As long as no surrogate-pairs are there]

 Not if you're working on the internet. There are several critical
 groups of characters that aren't in the BMP, such as:

 Of course. But what has the internet to do with micropython?

Earlier you said:

 IOW from pov of a universallly acceptable character set this is mostly
 rubbish

Universally acceptable character set and microcontrollers may well
not meet, but if you're talking about universality, you need Unicode.
It's that simple.

Maybe there's a use-case for a microcontroller that works in
ISO-8859-5 natively, thus using only eight bits per character, but
even if there is, I would expect a Python implementation on it to
expose Unicode codepoints in its strings. (Most of the time you won't
even be aware of the exact codepoint values. It's only when you put
\xNN or \u or U000N escapes into your strings, or explicitly
use ord/chr or equivalent, that it'd make a difference.) The point is
not that you might be able to get away with sticking your head in the
sand and wishing Unicode would just go away. Even if you can, it's not
something Python 3 can ever do.

And I don't think anybody can, anyway. If your device is big enough to
hold Python, it should be big enough to handle Unicode; and then you
don't have to say Oh, sorry rest-of-the-world, this only works in
English... and only a subset of English... and stuff.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico
On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody rustompm...@gmail.com wrote:
 1) Most or all Chinese and Japanese characters

 Dont know how you count 'most'

 | One possible rationale is the desire to limit the size of the full
 | Unicode character set, where CJK characters as represented by discrete
 | ideograms may approach or exceed 100,000 (while those required for
 | ordinary literacy in any language are probably under 3,000). Version 1
 | of Unicode was designed to fit into 16 bits and only 20,940 characters
 | (32%) out of the possible 65,536 were reserved for these CJK Unified
 | Ideographs. Later Unicode has been extended to 21 bits allowing many
 | more CJK characters (75,960 are assigned, with room for more).

 | From http://en.wikipedia.org/wiki/Han_unification

 So there are 20,940 CJK characters in the BMP, and approximately
 55,000 more in the SIP.  I'd count 55,000 out of 75,960 as most.

And I said or all because I have this vague notion that either NFC
or NFD pushes stuff out of the BMP, although I may be wrong on that.
But certainly 55K/75K with room for more is the most that I was
talking about. (Maybe it isn't most by usage. After all, hypertext
documents are usually smaller in UTF-8 than in UTF-16, despite most
characters (counting purely by 21-bit space in codepoints) being more
compact in UTF-16; most by usage is of ASCII, because hypertext
involves a lot of punctuation and such. But still, there are a lot of
CJK that aren't in the BMP.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Steven D'Aprano
On Wed, 04 Jun 2014 17:16:13 +1000, Chris Angelico wrote:

 On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody rustompm...@gmail.com
 wrote:
 On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
  And so a pure BMP-supporting implementation may be a reasonable
  compromise. [As long as no surrogate-pairs are there]

 Not if you're working on the internet. There are several critical
 groups of characters that aren't in the BMP, such as:

 Of course. But what has the internet to do with micropython?

When I download a script from the Internet to run on my microcontroller, 
written by somebody in Greece, and it calls print on a Greek string, I 
should see Greek text even if I'm in Sweden or New Zealand or Japan.

A fully localised application would be better, of course, but failing 
that I shouldn't see moji-bake.


 Earlier you said:
 
 IOW from pov of a universallly acceptable character set this is mostly
 rubbish
 
 Universally acceptable character set and microcontrollers may well not
 meet, but if you're talking about universality, you need Unicode. It's
 that simple.

 
 Maybe there's a use-case for a microcontroller that works in ISO-8859-5
 natively, thus using only eight bits per character, 

That won't even make the Russians happy, since in Russia there are 
multiple incompatible legacy encodings.



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Paul Rubin
Steven D'Aprano st...@pearwood.info writes:
 Maybe there's a use-case for a microcontroller that works in ISO-8859-5
 natively, thus using only eight bits per character, 
 That won't even make the Russians happy, since in Russia there are 
 multiple incompatible legacy encodings.

I've never understood why not use UTF-8 for everything.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Wolfgang Maier

On 04.06.2014 09:16, Chris Angelico wrote:

The point is
not that you might be able to get away with sticking your head in the
sand and wishing Unicode would just go away. Even if you can, it's not
something Python 3 can ever do.



Exactly. These endless discussions about different encodings start to 
get really boring. I cannot think of any aspect of it that hasn't been 
discussed here on several occasions, but as a fact:


Strings are immutable sequences of Unicode code points in Python3 
(https://docs.python.org/3/library/stdtypes.html?highlight=str#textseq) 
and this is not an implementation detail. So if any implementation 
doesn't stick to this convention, it is simply incomplete.



And I don't think anybody can, anyway. If your device is big enough to
hold Python, it should be big enough to handle Unicode; and then you
don't have to say Oh, sorry rest-of-the-world, this only works in
English... and only a subset of English... and stuff.



Wolfgang
--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker

On 04/06/2014 08:58, Paul Rubin wrote:

Steven D'Aprano st...@pearwood.info writes:

Maybe there's a use-case for a microcontroller that works in ISO-8859-5
natively, thus using only eight bits per character,

That won't even make the Russians happy, since in Russia there are
multiple incompatible legacy encodings.


I've never understood why not use UTF-8 for everything.


me too

-mojibaked-ly yrs-
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase
On 2014-06-04 00:58, Paul Rubin wrote:
 Steven D'Aprano st...@pearwood.info writes:
  Maybe there's a use-case for a microcontroller that works in
  ISO-8859-5 natively, thus using only eight bits per character, 
  That won't even make the Russians happy, since in Russia there
  are multiple incompatible legacy encodings.
 
 I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree).  One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded
piece. /random_ramblings

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker

On 04/06/2014 12:01, Tim Chase wrote:

On 2014-06-04 00:58, Paul Rubin wrote:

Steven D'Aprano st...@pearwood.info writes:

Maybe there's a use-case for a microcontroller that works in
ISO-8859-5 natively, thus using only eight bits per character,

That won't even make the Russians happy, since in Russia there
are multiple incompatible legacy encodings.


I've never understood why not use UTF-8 for everything.


If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.



I believe that we should distinguish between glyph/character indexing and string 
indexing. Even in unicode it may be hard to decide where a visual glyph starts 
and ends. I assume most people would like to assign one glyph to one unicode, 
but that's not always possible with composed glyphs.


 for a in (u'\xc5',u'A\u030a'):
... for o in (u'\xf6',u'o\u0308'):
... u=a+u'ngstr'+o+u'm'
... print(%s %s % (repr(u),u))
...
u'\xc5ngstr\xf6m' Ångström
u'\xc5ngstro\u0308m' Ångström
u'A\u030angstr\xf6m' Ångström
u'A\u030angstro\u0308m' Ångström
 u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
False

so even unicode doesn't always allow for O(1) glyph indexing. I know this is 
artificial, but this is the same situation as utf8 faces just the frequency of 
occurrence is different. A very large amount of computing is still western 
centric so searching a byte string for latin characters is still efficient; 
searching for an n with a tilde on top might not be so easy.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Marko Rauhamaa
Tim Chase python.l...@tim.thechases.com:

 On 2014-06-04 00:58, Paul Rubin wrote:
 I've never understood why not use UTF-8 for everything.

 If you use UTF-8 for everything, then you end up in a world where
 string-indexing (see ChrisA's other side thread on this topic) is no
 longer an O(1) operation, but an O(N) operation.

Most string operations are O(N) anyway. Besides, you could try and be
smart and keep a recent index cached so simple for loops would be O(N)
instead of O(N**2). So the idea of keeping strings internally in UTF-8
might not be all that bad.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Marko Rauhamaa
Robin Becker ro...@reportlab.com:

 u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
 False

Now *that* would be a valid reason for our resident Unicode expert to
complain! Py3 in no way solves text representation issues definitively.

 I know this is artificial

Not at all. It probably is out of scope for Python, but it is a real
cause for human suffering. What's Unicode for résumé?

Note, for example, that Google manages to sort out issues like these. It
sees past diacritics and even case ending.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase
On 2014-06-04 12:53, Robin Becker wrote:
  If you use UTF-8 for everything, then you end up in a world where
  string-indexing (see ChrisA's other side thread on this topic) is
  no longer an O(1) operation, but an O(N) operation.  Some of us
  slice strings for a living. ;-)
 
 I believe that we should distinguish between glyph/character
 indexing and string indexing. 

I'm only talking about string indexing using my_string[some_slice]
which is traditionally O(1) and breaking that [cw]ould cause
unexpected performance degradation.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase
On 2014-06-04 14:57, Marko Rauhamaa wrote:
  If you use UTF-8 for everything, then you end up in a world where
  string-indexing (see ChrisA's other side thread on this topic) is
  no longer an O(1) operation, but an O(N) operation.  
 
 Most string operations are O(N) anyway. Besides, you could try and
 be smart and keep a recent index cached so simple for loops would
 be O(N) instead of O(N**2). So the idea of keeping strings
 internally in UTF-8 might not be all that bad.

As mentioned elsewhere, I've got a LOT of code that expects that
string indexing is O(1) and rarely are those strings/offsets reused
I'm streaming through customer/provider data files, so caching
wouldn't do much good other than waste space and the time to maintain
them.

If I knew that string indexing was O(something non constant), I'd
have retooled my algorithms to take that into consider, but that
would be a lot of code I'd need to touch.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker

On 04/06/2014 13:17, Marko Rauhamaa wrote:
.


Note, for example, that Google manages to sort out issues like these. It
sees past diacritics and even case ending.

.
I guess they must normalize all inputs to some standard form and then search / 
eigenvectorize on those. There are quite a few diacritics and a fair few glyphs 
they could be applied to. I don't think it likely they could map all possible 
combinations to a private range.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Steven D'Aprano
On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:

 I believe that we should distinguish between glyph/character indexing
 and string indexing. Even in unicode it may be hard to decide where a
 visual glyph starts and ends. I assume most people would like to assign
 one glyph to one unicode, but that's not always possible with composed
 glyphs.
 
   for a in (u'\xc5',u'A\u030a'):
 ...   for o in (u'\xf6',u'o\u0308'):
 ...   u=a+u'ngstr'+o+u'm'
 ...   print(%s %s % (repr(u),u))
 ...
 u'\xc5ngstr\xf6m' Ångström
 u'\xc5ngstro\u0308m' Ångström
 u'A\u030angstr\xf6m' Ångström
 u'A\u030angstro\u0308m' Ångström
  u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
 False
 
 so even unicode doesn't always allow for O(1) glyph indexing.

What you're talking about here is graphemes, not glyphs. Glyphs are the 
little pictures that represent the characters when written down. 
Graphemes (technically, grapheme clusters) are the things which native 
speakers of a language believe ought to be considered a single unit. 
Think of them as similar to letters. That can be quite tricky to 
determine, and is dependent on the language you are speaking. The letters 
ch are considered two letters in English, but only a single letter in 
Czech and Slovak.

I believe that *grapheme-aware* text processing is *far* too complicated 
for a programming language to promise. If you think that len() needs to 
count graphemes, then what should len(ch) return, 1 or 2? Grapheme 
processing is a complex, complicated task best left up to powerful 
libraries built on top of a sturdy Unicode base.

 I know this is artificial, 

But it isn't artificial in the least. Unicode isn't complicated because 
it's badly designed, or complicated for the sake of complexity. It's 
complicated because human language is complicated. That, and because of 
legacy encodings.


 but this is the same situation as utf8 faces just
 the frequency of occurrence is different. A very large amount of
 computing is still western centric so searching a byte string for latin
 characters is still efficient; searching for an n with a tilde on top
 might not be so easy.

This is a good point, but on balance I disagree. A grapheme-aware library 
is likely to need to be based on more complex data structures than simple 
strings (arrays of code points). But for the underlying relatively simple 
string library, graphemes are too hard. Code points are simple, and the 
language can deal with code points without caring about their semantics. 
For instance, in English, I might not want to insert letters between the 
q and u of queen, since in English u (nearly) always follows q. It 
would be inappropriate for the programming language string library to 
care about that, and similarly it would be inappropriate for it to care 
that u'A\u030a' represents a single grapheme Å.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Paul Rubin
Tim Chase python.l...@tim.thechases.com writes:
 As mentioned elsewhere, I've got a LOT of code that expects that
 string indexing is O(1) and rarely are those strings/offsets reused
 I'm streaming through customer/provider data files, so caching
 wouldn't do much good other than waste space and the time to maintain
 them.

I'm having trouble understanding -- if they're only used once then
what's the problem?  You're reading some enormous file into a string and
then randomly accessing it by character offset?  What size are these
strings?  I can think of a number of workarounds including language
extensions, but mostly I'd be interested in seeing some actual
benchmarks of your unmodified program under both representations.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Roy Smith
In article mailman.10673.1401853976.18130.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 You can't ignore those. You might be able to say Well, my program
 will run slower if you throw these at it, but if you're going down
 that route, you probably want the full FSR and the advantages it
 confers on ASCII and Latin-1 strings. Binding your program to BMP-only
 is nearly as dangerous as binding it to ASCII-only; potentially worse,
 because you can run an awful lot of artificial tests without
 remembering to stick in some astral characters.

Yup.  I wrote a while(*) back about the pain I was having importing some 
data into a MySQL(**) database which (unknown to me when I started) only 
handled BMP.  It turns out in the entire dataset of 20-odd million 
records, there were exactly four that had astral characters.  All of my 
tests worked.  I didn't discover the problem until it blew up many hours 
into the final production import run.

(*) Two years?

(**) This was not the only pain point with MySQL.  We eventually 
switched to Postgress.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Rustom Mody
On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
  Chris Angelico  wrote:

  You can't ignore those. You might be able to say Well, my program
  will run slower if you throw these at it, but if you're going down
  that route, you probably want the full FSR and the advantages it
  confers on ASCII and Latin-1 strings. Binding your program to BMP-only
  is nearly as dangerous as binding it to ASCII-only; potentially worse,
  because you can run an awful lot of artificial tests without
  remembering to stick in some astral characters.

 Yup.  I wrote a while(*) back about the pain I was having importing some 
 data into a MySQL(**) database which (unknown to me when I started) only 
 handled BMP.  It turns out in the entire dataset of 20-odd million 
 records, there were exactly four that had astral characters.  All of my 
 tests worked.  I didn't discover the problem until it blew up many hours 
 into the final production import run.

 (*) Two years?

 (**) This was not the only pain point with MySQL.  We eventually 
 switched to Postgress.

Thanks Roy for bringing up that example - I was trying to recollect
the details.  I forgot about the MySQL angle which adds a different
twist to it.

Here's my interpretation of that situation; I'd like to hear yours:

Basic problem was that MySQL handled a strict subset of what the rest
of the system (Python 2.7?)  could handle.  This meant that at a late
(and embarrassing) stage, exceptions were being thrown, from deep
within the system.

OTOH, let's say you could detect the 'error' (more correctly
'un-handle-able') at the borders of your system, say when the user
enters the data on a web-form. Would you have a problem kicking out
those characters (in both senses!) with a curt:

Cant deal with all this supra-galactic rubble! ?

Of course switching to postgres may be a sound choice on other fronts.
But if that were not an option, and you only had these choices:

- significantly complexify your MySQL data structures to handle 4 in
  20 million cases
- just detect and throw such cases out at the outset

which would you take?

In any case this is the choice I hear from the micropython folks
who are explicitly seeking a cutdown version of python

-- 
https://mail.python.org/mailman/listinfo/python-list


Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Damien George
Hi,

We would like to announce Micro Python, an implementation of Python 3
optimised to have a low memory footprint.

While Python has many attractive features, current implementations
(read CPython) are not suited for embedded devices, such as
microcontrollers and small systems-on-a-chip.  This is because CPython
uses an awful lot of RAM -- both stack and heap -- even for simple
things such as integer addition.

Micro Python is a new implementation of the Python 3 language, which
aims to be properly compatible with CPython, while sporting a very
minimal RAM footprint, a compact compiler, and a fast and efficient
runtime.  These goals have been met by employing many tricks with
pointers and bit stuffing, and placing as much as possible in
read-only memory.

Micro Python has the following features:

- Supports almost full Python 3 syntax, including yield (compiles
99.99% of the Python 3 standard library).
- Most scripts use significantly less RAM in Micro Python, and various
benchmark programs run faster, compared with CPython.
- A minimal ARM build fits in 80k of program space, and with all
features enabled it fits in around 200k on Linux.
- Micro Python needs only 2k RAM for a basic REPL.
- It has 2 modes of AOT (ahead of time) compilation to native machine
code, doubling execution speed.
- There is an inline assembler for use in time-critical
microcontroller applications.
- It is written in C99 ANSI C and compiles cleanly under Unix (POSIX),
Mac OS X, Windows and certain ARM based microcontrollers.
- It supports a growing subset of Python 3 types and operations.
- Part of the Python 3 standard library has already been ported to
Micro Python, and work is ongoing to port as much as feasible.

More info at:

http://micropython.org/

You can follow the progress and contribute at github:

www.github.com/micropython/micropython
www.github.com/micropython/micropython-lib

--
Damien / Micro Python team.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico
On Tue, Jun 3, 2014 at 10:27 PM, Damien George
damien.p.geo...@gmail.com wrote:
 - Supports almost full Python 3 syntax, including yield (compiles
 99.99% of the Python 3 standard library).
 - It supports a growing subset of Python 3 types and operations.
 - Part of the Python 3 standard library has already been ported to
 Micro Python, and work is ongoing to port as much as feasible.

I don't have an actual use-case for this, as I don't target
microcontrollers, but I'm curious: What parts of Py3 syntax aren't
supported? And since you say port as much as feasible, presumably
there'll be parts that are never supported. Are there some syntactic
elements that just take up way too much memory?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Steven D'Aprano
On Tue, 03 Jun 2014 13:27:11 +0100, Damien George wrote:

 Hi,
 
 We would like to announce Micro Python, an implementation of Python 3
 optimised to have a low memory footprint.

Fantastic!




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Paul Sokolovsky
Hello,

On Tue, 3 Jun 2014 23:11:46 +1000
Chris Angelico ros...@gmail.com wrote:

 On Tue, Jun 3, 2014 at 10:27 PM, Damien George
 damien.p.geo...@gmail.com wrote:
  - Supports almost full Python 3 syntax, including yield (compiles
  99.99% of the Python 3 standard library).
  - It supports a growing subset of Python 3 types and operations.
  - Part of the Python 3 standard library has already been ported to
  Micro Python, and work is ongoing to port as much as feasible.
 
 I don't have an actual use-case for this, as I don't target
 microcontrollers, 

Please let me chime in, as one of MicroPython contributors. I also
don't have immediate usecase for a Python microcontroller (but seeing
how fast industry moves, I won't be surprised if in half-year it will
seem just right). Instead, I treat MicroPython as a Python
implementation which scales *down* very well. With current situation in
the industry, people mostly care about scaling up - consume more
gigabytes and gigahertz, catch more clouds and include heavier and
heavier batteries.

MicroPython goes another direction. You don't have to use it on a
microcontroller. It's just if you want/need it, you'll be able - while
still staying with your favorite language.

I'm personally interested in using MicroPython on a small embedded
Linux systems, like home routers, Internet-of-Thing devices, etc. Such
devices usually have just few hundreds of megahertz of CPU power, and
2-4MB of flash. And to cut cost, the lower bound decreases all the
time.

 but I'm curious: What parts of Py3 syntax aren't
 supported? And since you say port as much as feasible, presumably
 there'll be parts that are never supported. Are there some syntactic
 elements that just take up way too much memory?

Syntax-wise, all Python 3.3 syntax is supported. This includes things
like yield from, annotations, etc. For example:

$ micropython 
Micro Python v1.0.1-139-g411732e on 2014-06-03; UNIX version
 def foo(a:int) - float:
... return float(a)
... 
 foo(4)
4.0


99.9% statement is due to fact that there were some problems parsing
couple of files in CPython 3.3/3.4 stdlib.

Note that above talks about syntax, not semantics. Though core
language semantics is actually now implemented pretty well. For
example, yield from works pretty well, so asyncio could work ;-).
(Except my analysis showed that CPython's implementation is a bit
bloated for MicroPython requirements, so I started to write a
simplified implementation from scratch).


As can be seen from the dump above, MicroPython perfectly works on a
Linux system, so we encourage any pythonista to touch a little bit of
Python magic and give it a try! ;-) And we of course interested to get
feedback how portable it is, etc.

(As a side note, it's of course possible to compile and run MicroPython
on Windows too, it's a bit more complicated than just make.)

 
 ChrisA
 -- 
 https://mail.python.org/mailman/listinfo/python-list



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 2:49 AM, Paul Sokolovsky pmis...@gmail.com wrote:
 As can be seen from the dump above, MicroPython perfectly works on a
 Linux system, so we encourage any pythonista to touch a little bit of
 Python magic and give it a try! ;-) And we of course interested to get
 feedback how portable it is, etc.


With that encouragement, I just cloned your repo and built it on amd64
Debian Wheezy. Works just fine! Except... I've just found one fairly
major problem with your support of Python 3.x syntax. Your str type is
documented as not supporting Unicode. Is that a current flaw that
you're planning to remove, or a design limitation? Either way, I'm a
bit dubious about a purported version 1 that doesn't do one of the
things that Py3 is especially good at - matched by very few languages
in its encouragement of best practice with Unicode support.

What is your str type actually able to support? It seems to store
non-ASCII bytes in it, which I presume are supposed to represent the
rest of Latin-1, but I wasn't able to print them out:

Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
 print(asdf\xfdqwer)

Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
[GCC 4.7.2] on linux
 print(asdf\xfdqwer)
asdfýqwer

In fact, printing seems to work with bytes:

 print(asdf\xc3\xbdqwer)
asdfýqwer

(my terminal uses UTF-8, this is the UTF-8 encoding of the above string)

I would strongly recommend either implementing all of PEP 393, or at
least making it very clear that this pretends everything is bytes -
and possibly disallowing any codepoint 127 in any string, which will
at least mean you're safe on all ASCII-compatible encodings.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Paul Sokolovsky
Hello,

On Wed, 4 Jun 2014 03:08:57 +1000
Chris Angelico ros...@gmail.com wrote:

[]

 With that encouragement, I just cloned your repo and built it on amd64
 Debian Wheezy. Works just fine! Except... I've just found one fairly
 major problem with your support of Python 3.x syntax. Your str type is
 documented as not supporting Unicode. Is that a current flaw that
 you're planning to remove, or a design limitation? Either way, I'm a
 bit dubious about a purported version 1 that doesn't do one of the
 things that Py3 is especially good at - matched by very few languages
 in its encouragement of best practice with Unicode support.

I should start with saying that it's MicroPython what made me look at
Python3. So for me, it already did lot of boon by getting me from under
the rock, so now instead of at my job, we use python 2.x I may report
at my job, we don't wait when our distro will kick us in the ass, and
add 'from __future__ import print_function' whenever we touch some
code.

With that in mind, I, as many others, think that forcing Unicode bloat
upon people by default is the most controversial feature of Python3.
The reason is that you go very long way dealing with languages of the
people of the world by just treating strings as consisting of 8-bit
data. I'd say, that's enough for 90% of applications. Unicode is needed
only if one needs to deal with multiple languages *at the same time*,
which is fairly rare (remaining 10% of apps).

And please keep in mind that MicroPython was originally intended (and
should be remain scalable down to) an MCU. Unicode needed there is even
less, and even less resources to support Unicode just because.

 
 What is your str type actually able to support? It seems to store
 non-ASCII bytes in it, which I presume are supposed to represent the
 rest of Latin-1, but I wasn't able to print them out:

There's a work-in-progress on documenting differences between CPython
and MicroPython at
https://github.com/micropython/micropython/wiki/Differences, it gives
following account on this:

No unicode support is actually implemented. Python3 calls for strict
difference between str and bytes data types (unlike Python2, which has
neutral unified data type for strings and binary data, and separates
out unicode data type). MicroPython faithfully implements str/bytes
separation, but currently, underlying str implementation is the same as
bytes. This means strings in MicroPython are not unicode, but 8-bit
characters (fully binary-clean).

 
 Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
  print(asdf\xfdqwer)
 
 Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
 [GCC 4.7.2] on linux
  print(asdf\xfdqwer)
 asdfýqwer
 
 In fact, printing seems to work with bytes:
 
  print(asdf\xc3\xbdqwer)
 asdfýqwer
 
 (my terminal uses UTF-8, this is the UTF-8 encoding of the above
 string)
 
 I would strongly recommend either implementing all of PEP 393, or at
 least making it very clear that this pretends everything is bytes -
 and possibly disallowing any codepoint 127 in any string, which will
 at least mean you're safe on all ASCII-compatible encodings.

MicroPython is not the first tiny Python implementation. What differs
MicroPython is that it's neither aim or motto to be a subset of
language. And yet, it's not CPython rewrite either. So, while Unicode
support is surely possible, it's unlikely to be done as all of
PEPxxx. If you ask me, I'd personally envision it to be implemented as
UTF-8 (in this regard I agree with (or take an influence from) 
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/). But I don't have plans
to work on Unicode any time soon - applications I envision for
MicroPython so far fit in those 90% that live happily without Unicode.

But generally, there's no strict roadmap for MicroPython features.
While core of the language (parser, compiler, VM) is developed by
Damien, many other features were already contributed by the community
(project went open-source at the beginning of the year). So, if someone
will want to see Unicode support up to the level of providing patches,
it gladly will be accepted. The only thing we established is that we
want to be able to scale down, and thus almost all features should be
configurable.


 
 ChrisA
 -- 
 https://mail.python.org/mailman/listinfo/python-list



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 7:41 AM, Paul Sokolovsky pmis...@gmail.com wrote:
 Hello,

 On Wed, 4 Jun 2014 03:08:57 +1000
 Chris Angelico ros...@gmail.com wrote:

 []

 With that encouragement, I just cloned your repo and built it on amd64
 Debian Wheezy. Works just fine! Except... I've just found one fairly
 major problem with your support of Python 3.x syntax. Your str type is
 documented as not supporting Unicode. Is that a current flaw that
 you're planning to remove, or a design limitation? Either way, I'm a
 bit dubious about a purported version 1 that doesn't do one of the
 things that Py3 is especially good at - matched by very few languages
 in its encouragement of best practice with Unicode support.

 I should start with saying that it's MicroPython what made me look at
 Python3. So for me, it already did lot of boon by getting me from under
 the rock, so now instead of at my job, we use python 2.x I may report
 at my job, we don't wait when our distro will kick us in the ass, and
 add 'from __future__ import print_function' whenever we touch some
 code.

And that's a good thing :) Using Python 2.7 and starting to put in the
future directives breaks nothing, and will save you time later.

 With that in mind, I, as many others, think that forcing Unicode bloat
 upon people by default is the most controversial feature of Python3.
 The reason is that you go very long way dealing with languages of the
 people of the world by just treating strings as consisting of 8-bit
 data. I'd say, that's enough for 90% of applications. Unicode is needed
 only if one needs to deal with multiple languages *at the same time*,
 which is fairly rare (remaining 10% of apps).

Absolutely not. This is the mentality that results in web applications
that break on funny characters, which is completely the wrong way to
look at it. The truth is, there are not many funny characters in
Unicode at all; I found these, but that's about it:

http://www.fileformat.info/info/unicode/char/1F601/index.htm
http://www.fileformat.info/info/unicode/char/1F638/index.htm

Your code should accept any valid character with equal correctness.
(Note to jmf: Correctness does not necessarily imply exact nanosecond
performance, just that the right result is reached.) These days,
Unicode *is* needed everywhere. You might think you can get away with
8-bit data, but is that 8-bit data actually encoded Latin-1 or
UTF-8? There's a vast difference between them, and you'll hit it in
any English text with U+00A9 ©, or U+201C U+201D quotes, or any of a
large number of other common non-ASCII characters. Oh, and the three I
just mentioned happen to be in CP-1252, another common 8-bit encoding,
and a lot of people and programs don't know how to tell CP-1252 from
Latin-1 and label one as the other.

Unicode is needed on anything that touches the internet, which is a
*lot* more than 10% of applications. Unicode is also needed on
anything that shares files with anyone who speaks more than one
language, or uses any symbol that isn't in ASCII, or pretty much
anything beyond plain English with a restricted set of punctuation.
And even if you can guarantee that you're working only with English
and only with ASCII, you still need to be aware that ASCII text is
different stuff from a JPEG file, although it's possible to bury
your head in the sand over that one.

 But generally, there's no strict roadmap for MicroPython features.
 While core of the language (parser, compiler, VM) is developed by
 Damien, many other features were already contributed by the community
 (project went open-source at the beginning of the year). So, if someone
 will want to see Unicode support up to the level of providing patches,
 it gladly will be accepted. The only thing we established is that we
 want to be able to scale down, and thus almost all features should be
 configurable.

And that's exactly what's happening right now.

https://github.com/micropython/micropython/issues/657
https://github.com/Rosuav/micropython

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody
On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:

 With that in mind, I, as many others, think that forcing Unicode bloat
 upon people by default is the most controversial feature of Python3.
 The reason is that you go very long way dealing with languages of the
 people of the world by just treating strings as consisting of 8-bit
 data. I'd say, that's enough for 90% of applications. Unicode is needed
 only if one needs to deal with multiple languages *at the same time*,
 which is fairly rare (remaining 10% of apps).

 And please keep in mind that MicroPython was originally intended (and
 should be remain scalable down to) an MCU. Unicode needed there is even
 less, and even less resources to support Unicode just because.

At some time (when jmf was making more intelligible noises) I had
suggested that the choice between 1/2/4 byte strings that happens at
runtime in python3's FSR can be made at python-start time with a
command-line switch.  There are many combinations here; here is one in
more detail:

Instead of having one (FSR) string engine, you have (upto) 4

- a pure 1 byte (ASCII)
- a pure 2 byte (BMP) with decode-failures for out-of-ranges
- a pure 4 byte -- everything UTF-32
- FSR dynamic switching at runtime (with massive moping from the world's jmfs)

The point is that only one of these engines would be brought into memory
based on command-line/config options.

Some more personal thoughts (that may be quite ill-informed!):

1. I regard myself as a unicode ignoramus+enthusiast. The world will
be a better place if unicode is more pervasive.
See http://blog.languager.org/2014/04/unicoded-python.html

As it happens I am also a computer scientist -- I understand that in
contexts where anything other than 8-bit chars is unacceptably
inefficient, unicode-bloat may be a real thing.

2. My casual/cursory reading of the contents of the SMP-planes
suggests that the stuff there is are things like
- egyptian hieroplyphics
- mahjong characters
- ancient greek musical symbols
- alchemical symbols etc etc.

IOW from pov of a universallly acceptable character set this is mostly
rubbish

And so a pure BMP-supporting implementation may be a reasonable
compromise. [As long as no surrogate-pairs are there]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico
On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody rustompm...@gmail.com wrote:
 2. My casual/cursory reading of the contents of the SMP-planes
 suggests that the stuff there is are things like
 - egyptian hieroplyphics
 - mahjong characters
 - ancient greek musical symbols
 - alchemical symbols etc etc.

 IOW from pov of a universallly acceptable character set this is mostly
 rubbish

 And so a pure BMP-supporting implementation may be a reasonable
 compromise. [As long as no surrogate-pairs are there]

Not if you're working on the internet. There are several critical
groups of characters that aren't in the BMP, such as:

1) Most or all Chinese and Japanese characters
2) Heaps of emoticons and fancy letters
3) Mathematical symbols

You can't ignore those. You might be able to say Well, my program
will run slower if you throw these at it, but if you're going down
that route, you probably want the full FSR and the advantages it
confers on ASCII and Latin-1 strings. Binding your program to BMP-only
is nearly as dangerous as binding it to ASCII-only; potentially worse,
because you can run an awful lot of artificial tests without
remembering to stick in some astral characters.

It's not rubbish. It's important stuff that you need to deal with.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody
On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
  And so a pure BMP-supporting implementation may be a reasonable
  compromise. [As long as no surrogate-pairs are there]

 Not if you're working on the internet. There are several critical
 groups of characters that aren't in the BMP, such as:

Of course. But what has the internet to do with micropython?

This is their stated goal:
| Micro Python is a lean and fast implementation of the Python
| programming language (python.org) that is optimised to run on a
| microcontroller.


 1) Most or all Chinese and Japanese characters

Dont know how you count 'most'

| One possible rationale is the desire to limit the size of the full
| Unicode character set, where CJK characters as represented by discrete
| ideograms may approach or exceed 100,000 (while those required for
| ordinary literacy in any language are probably under 3,000). Version 1
| of Unicode was designed to fit into 16 bits and only 20,940 characters
| (32%) out of the possible 65,536 were reserved for these CJK Unified
| Ideographs. Later Unicode has been extended to 21 bits allowing many
| more CJK characters (75,960 are assigned, with room for more).

| From http://en.wikipedia.org/wiki/Han_unification
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Ian Kelly
On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody rustompm...@gmail.com wrote:
 1) Most or all Chinese and Japanese characters

 Dont know how you count 'most'

 | One possible rationale is the desire to limit the size of the full
 | Unicode character set, where CJK characters as represented by discrete
 | ideograms may approach or exceed 100,000 (while those required for
 | ordinary literacy in any language are probably under 3,000). Version 1
 | of Unicode was designed to fit into 16 bits and only 20,940 characters
 | (32%) out of the possible 65,536 were reserved for these CJK Unified
 | Ideographs. Later Unicode has been extended to 21 bits allowing many
 | more CJK characters (75,960 are assigned, with room for more).

 | From http://en.wikipedia.org/wiki/Han_unification

So there are 20,940 CJK characters in the BMP, and approximately
55,000 more in the SIP.  I'd count 55,000 out of 75,960 as most.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Steven D'Aprano
On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:

 On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
 
 With that in mind, I, as many others, think that forcing Unicode bloat
 upon people by default is the most controversial feature of Python3.
 The reason is that you go very long way dealing with languages of the
 people of the world by just treating strings as consisting of 8-bit
 data. I'd say, that's enough for 90% of applications. Unicode is needed
 only if one needs to deal with multiple languages *at the same time*,
 which is fairly rare (remaining 10% of apps).
 
 And please keep in mind that MicroPython was originally intended (and
 should be remain scalable down to) an MCU. Unicode needed there is even
 less, and even less resources to support Unicode just because.
 
 At some time (when jmf was making more intelligible noises) I had
 suggested that the choice between 1/2/4 byte strings that happens at
 runtime in python3's FSR can be made at python-start time with a
 command-line switch.  There are many combinations here; here is one in
 more detail:
 
 Instead of having one (FSR) string engine, you have (upto) 4
 
 - a pure 1 byte (ASCII)

There are only 128 ASCII characters, so a pure ASCII implementation 
cannot even represent arbitrary bytes.


 - a pure 2 byte (BMP) with decode-failures for out-of-ranges

That's not Unicode. It's a subset of Unicode.


 - a pure 4 byte -- everything UTF-32

For embedded devices, that would be extremely memory hungry. Remember, 
every variable, every attribute name, every method and class and function 
name is a string. Using at least 56 bytes just to refer to 
sys.stdout.write will be painful.


 - FSR dynamic switching at runtime (with massive moping from the world's
 jmfs)

Please stop giving JMF's crackpot opinion even the dignity of being 
sneered at.

[...]
 2. My casual/cursory reading of the contents of the SMP-planes suggests
 that the stuff there is are things like - egyptian hieroplyphics
 - mahjong characters
 - ancient greek musical symbols
 - alchemical symbols etc etc.
 
 IOW from pov of a universallly acceptable character set this is mostly
 rubbish

Certainly some of these things are more whimsical than practical, but it 
doesn't really matter. Even if you strip out every bit of whimsy from the 
Unicode character set, you're still left with needing more than 65536 
characters (16 bits). For efficiency you aren't going to use 17 bits, or 
18, or 19, so it's actually faster and more efficient to jump right to 32 
bits. For technical reasons which I don't fully understand, Unicode only 
uses 21 of those 32 bits, giving a total of 1114112 available code 
points. Whether you or I personally have need for alchemical symbols, 
*some people* do, and supporting their use-case doesn't harm us by one 
bit.


 And so a pure BMP-supporting implementation may be a reasonable
 compromise. [As long as no surrogate-pairs are there]

At the cost on one extra bit, strings could use UTF-16 internally and 
still have correct behaviour. The bit could be a flag recording whether 
the string contains any surrogate pairs. If the flag was 0, all string 
operations could assume a constant 2-bytes-per-character. If the flag was 
1, it could fall back to walking the string checking for surrogate pairs.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody
On Wednesday, June 4, 2014 10:50:21 AM UTC+5:30, Steven D'Aprano wrote:
 On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:
  And so a pure BMP-supporting implementation may be a reasonable
  compromise. [As long as no surrogate-pairs are there]

 At the cost on one extra bit, strings could use UTF-16 internally and 
 still have correct behaviour. The bit could be a flag recording whether 
 the string contains any surrogate pairs. If the flag was 0, all string 
 operations could assume a constant 2-bytes-per-character. If the flag was 
 1, it could fall back to walking the string checking for surrogate pairs.

Yes.  That could be one possibility.  My main reason in giving the
4-engine choice was not that 4 engines are a good idea but that in the
very differently constrained world of μ-controllers playing around with
alternate binding times may be advantageous


  On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
  With that in mind, I, as many others, think that forcing Unicode bloat
  upon people by default is the most controversial feature of Python3.
  The reason is that you go very long way dealing with languages of the
  people of the world by just treating strings as consisting of 8-bit
  data. I'd say, that's enough for 90% of applications. Unicode is needed
  only if one needs to deal with multiple languages *at the same time*,
  which is fairly rare (remaining 10% of apps).
  And please keep in mind that MicroPython was originally intended (and
  should be remain scalable down to) an MCU. Unicode needed there is even
  less, and even less resources to support Unicode just because.
  At some time (when jmf was making more intelligible noises) I had
  suggested that the choice between 1/2/4 byte strings that happens at
  runtime in python3's FSR can be made at python-start time with a
  command-line switch.  There are many combinations here; here is one in
  more detail:
  Instead of having one (FSR) string engine, you have (upto) 4
  - a pure 1 byte (ASCII)

 There are only 128 ASCII characters, so a pure ASCII implementation 
 cannot even represent arbitrary bytes.

Yes this is a subtle point.
I was initially going to write Latin-1. Wrote a rough-n-ready ASCII.
But maybe it could be a choice.

I really dont understand the binding-times of μ-controllers.

My impression is that actual development is split 
1 tinkering with the board
2 working on full powered computers and downloading to the board

In going from 2 to 1 heavy amounts of cut-downs are probably possible and
desirable. If this is the case, having hooks in the system for making choices 
may be a good idea
optimal choices may be worthwhile
-- 
https://mail.python.org/mailman/listinfo/python-list