subject:"Micro Python \-\- a lean and efficient implementation of Python 3"

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread Michael Torrie

On 06/10/2014 01:43 PM, alister wrote:
> On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:
>> BTW, very easy to explain.

Yeah he keeps saying that, but he never does explain--just flails around
and mumbles "unicode.org."  Guess everyone has to have his or her
windmill to tilt at.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread Ben Finney

alister  writes:

> On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote:
> > By his own admission, jmf doesn't use Python anymore. His only
> > reason to remain on this emailing/newsgroup is to troll about the
> > FSR. Please don't reply to him (and preferably add him to your
> > killfile).
>
> I couldn't kill file JMF I find his posts useful

That's fine, kill-filing his posts is a matter that affects only you.

But please do not reply to them, nor taunt him in unrelated posts; it
disrupts this forum.
Instead, give him no reason to think anyone is interested.

-- 
 \ “Too many pieces of music finish too long after the end.” —Igor |
  `\   Stravinskey |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-11 Thread alister

On Wed, 11 Jun 2014 08:29:06 +1000, Tim Delaney wrote:

> On 11 June 2014 05:43, alister  wrote:
> 
> 
>> Your error reports always seem to resolve around benchmarks despite
>> speed not being one of Pythons prime objectives
>>
>>
> By his own admission, jmf doesn't use Python anymore. His only reason to
> remain on this emailing/newsgroup is to troll about the FSR. Please
> don't reply to him (and preferably add him to your killfile).
> 

I couldn't kill file JMF I find his posts useful
Every time i find myself agreeing with him I know I have got it wrong.



-- 
The nice thing about Windows is - It does not just crash, it displays a
dialog box and lets you press 'OK' first.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Mark Lawrence


On 11/06/2014 00:29, Devin Jeanpierre wrote:

Please don't be unnecessarily cruel and antagonistic.

-- Devin


I am simply giving our resident unicode expert a taste of his own 
medicine.  If you don't like that complain to the PSF about the root 
cause of the problem, not the symptoms.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Ethan Furman


On 06/10/2014 04:29 PM, Devin Jeanpierre wrote:


Please don't be unnecessarily cruel and antagonistic.


I completely agree.  jmf should leave us alone and stop cruelly and 
antagonizingly baiting us with stupidity and falsehoods.

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Steven D'Aprano

On Tue, 10 Jun 2014 19:43:13 +, alister wrote:

> On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:

Please don't feed the troll.

I don't know whether JMF is trolling or if he is a crank who doesn't 
understand what he is doing, but either way he's been trying to square 
this circle for the last couple of years. He believes, or *claims* to 
believe, that a performance regression (one which others cannot 
replicate) is *mathematical proof* that Python's Unicode handling is 
invalid. What can one say to crack-pottery of this magnitude?

Just kill-file his posts and be done.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Devin Jeanpierre

Please don't be unnecessarily cruel and antagonistic.

-- Devin

On Tue, Jun 10, 2014 at 4:16 PM, Mark Lawrence  wrote:
> On 10/06/2014 20:43, alister wrote:
>>
>> On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:
>>
>
> [snip the garbage]
>
>
>>>
>>> jmf
>>
>>
>> Your error reports always seem to resolve around benchmarks despite speed
>> not being one of Pythons prime objectives
>>
>> Computers store data using bytes
>> ASCII Characters can be used storing a single byte
>> Unicode code-points cannot be stored in a single byte
>> therefore Unicode will always be inherently slower than ASCII
>>
>> implementation details mean that some Unicode characters may be handled
>> more efficiently than others, why is this wrong?
>> why should all Unicode operations be equally slow?
>>
>
> I'd like to dedicate a song to jmf.  From the "Canterbury Sound" band
> Caravan, the album "The Battle Of Hastings", the song title "Liar".
>
> --
> My fellow Pythonistas, ask not what our language can do for you, ask what
> you can do for our language.
>
> Mark Lawrence
>
> ---
> This email is free from viruses and malware because avast! Antivirus
> protection is active.
> http://www.avast.com
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Mark Lawrence


On 10/06/2014 20:43, alister wrote:

On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:



[snip the garbage]



jmf


Your error reports always seem to resolve around benchmarks despite speed
not being one of Pythons prime objectives

Computers store data using bytes
ASCII Characters can be used storing a single byte
Unicode code-points cannot be stored in a single byte
therefore Unicode will always be inherently slower than ASCII

implementation details mean that some Unicode characters may be handled
more efficiently than others, why is this wrong?
why should all Unicode operations be equally slow?



I'd like to dedicate a song to jmf.  From the "Canterbury Sound" band 
Caravan, the album "The Battle Of Hastings", the song title "Liar".


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread Tim Delaney

On 11 June 2014 05:43, alister  wrote:

>
> Your error reports always seem to resolve around benchmarks despite speed
> not being one of Pythons prime objectives
>

By his own admission, jmf doesn't use Python anymore. His only reason to
remain on this emailing/newsgroup is to troll about the FSR. Please don't
reply to him (and preferably add him to your killfile).

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-10 Thread alister

On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:

> Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit :
>> On 2014-06-06 09:59, Travis Griggs wrote:
>> 
>> > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
>> 
>> > > If you use UTF-8 for everything
>> 
>> 
>> > 
>> > It seems to me, that increasingly other libraries (C, etc), use
>> 
>> > utf8 as the preferred string interchange format.
>> 
>> 
>> 
>> I definitely advocate UTF-8 for any streaming scenario, as you're
>> 
>> iterating unidirectionally over the data anyways, so why use/transmit
>> 
>> more bytes than needed.  The only failing of UTF-8 that I've found in
>> 
>> the real world(*) is when you have to requirement of constant-time
>> 
>> indexing into strings.
>> 
>> 
>> 
>> -tkc
> 
> And once again, just an illustration,
> 
 timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'")
> [0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
 timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'")
> [2.5541921791045183, 2.52434366066052, 2.5337417948967413]
 timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y =
 'z'.encode('utf-8')")
> [0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
 timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y =
 '\u0fce'.encode('utf-8')")
> [0.9320969737165115, 0.9086006535332558, 0.9051715140790861]
 
 
 sys.getsizeof('abc'*1000 + '\u0fce')
> 6040
 sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8'))
> 3020


> 
> But you know, that's not the problem.
> 
> When a see a core developper discussing benchmarking,
> when the same application using non ascii chars become 1, 2, 5, 10, 20
> if not more, slower comparing to pure ascii, I'm wondering if there is
> not a serious problem somewhere.
> 
> (and also becoming slower that Py3.2)
> 
> BTW, very easy to explain.
> 
> I do not understand why the "free, open, what-you-wish-here, ... "
> software is so often pushing to the adoption of serious corporate
> products.
> 
> jmf

Your error reports always seem to resolve around benchmarks despite speed 
not being one of Pythons prime objectives

Computers store data using bytes
ASCII Characters can be used storing a single byte
Unicode code-points cannot be stored in a single byte
therefore Unicode will always be inherently slower than ASCII

implementation details mean that some Unicode characters may be handled 
more efficiently than others, why is this wrong?
why should all Unicode operations be equally slow?



-- 
There isn't any problem
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Tim Chase

On 2014-06-06 09:59, Travis Griggs wrote:
> On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
> > If you use UTF-8 for everything
> 
> It seems to me, that increasingly other libraries (C, etc), use
> utf8 as the preferred string interchange format.

I definitely advocate UTF-8 for any streaming scenario, as you're
iterating unidirectionally over the data anyways, so why use/transmit
more bytes than needed.  The only failing of UTF-8 that I've found in
the real world(*) is when you have to requirement of constant-time
indexing into strings.

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Roy Smith

In article ,
 Travis Griggs  wrote:

> On Jun 4, 2014, at 4:01 AM, Tim Chase  wrote:
> 
> > If you use UTF-8 for everything
> 
> It seems to me, that increasingly other libraries (C, etc), use utf8 as the 
> preferred string interchange format. It¹s universal, not prone to endian 
> issues, etc.

One of the important etc factors is, "Since it's the most commonly used, 
it's the one that other people are most likely to have implemented 
correctly".  In the real world, these are important considerations.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Travis Griggs


On Jun 4, 2014, at 4:01 AM, Tim Chase  wrote:

> If you use UTF-8 for everything

It seems to me, that increasingly other libraries (C, etc), use utf8 as the 
preferred string interchange format. It’s universal, not prone to endian 
issues, etc. So one *advantage* you gain for using utf8 internally, is any time 
you need to hand a string to an external thing, it’s just ready. An app that 
reserves its internal string processing to streaming based ones but has to to 
hand strings to external libraries a lot (e.g. cairo) might actually benefit 
using utf8 internally, because a) it’s not doing the linear search for the odd 
character address and b) it no longer needs to decode/encode every time it 
sends or receives a string to an external library.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-06 Thread Anssi Saari

Chris Angelico  writes:
 
> I don't have an actual use-case for this, as I don't target
> microcontrollers, but I'm curious: What parts of Py3 syntax aren't
> supported?

I meant to say % formatting for strings but that's apparently been added
recently. My previous micropython build was from February.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-05 Thread Chris Angelico

On Thu, Jun 5, 2014 at 11:59 PM, Roy Smith  wrote:
> It turns out, we could have upgraded to a newer version of MySQL, which
> did handle astral characters correctly.  But, what we did was discarded
> the records containing non-BMP data.  Of course, that's a decision that
> can only be made when you understand the business requirements.  In our
> case, discarding those four records had no impact on our business, so it
> made sense.  For other people, not having the full dataset might have
> been a fatal problem.
>
> This was just one of many MySQL problems we ran into.  Eventually, we
> decided it wasn't worth fighting with what was obviously a brain-dead
> system, and switched databases.

Point to note: It's not just "Avoid MySQL version x.y.z, it's buggy",
but "Make sure you're on a sufficiently new version of MySQL *and then
use these settings*". For instance, the MySQL "utf8"
locale/collation/charset (not sure what it calls it) supports only the
BMP; you have to use "utf8mb4", which is UTF-8 that's allowed to go as
far as four bytes long.

What were they thinking?

What, were they thinking?

I understand there's now an alias "utf8mb3" for the buggy utf8, with
some theory that some future version of MySQL might make utf8 become
an alias for utf8mb4. But when would you ever actually *demand* this
buggy behaviour? Why not just say "as of this version, utf8 is
identical to utf8mb4, which was a superset thereof", and if anything
changes or breaks, just acknowledge that it used to be buggy?

Use PostgreSQL.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-05 Thread Roy Smith

In article ,
 Rustom Mody  wrote:

> On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
> > Yup.  I wrote a while(*) back about the pain I was having importing some 
> > data into a MySQL(**) database

> Here's my interpretation of that situation; I'd like to hear yours:
> 
> Basic problem was that MySQL handled a strict subset of what the rest
> of the system (Python 2.7?)  could handle.

Yes.  This was not a Python issue.  I was just responding to ChrisA's 
statement:

>>> Binding your program to BMP-only is nearly as dangerous as binding 
>>> it to ASCII-only; potentially worse, because you can run an awful 
>>> lot of artificial tests without remembering to stick in some astral 
>>> characters.

> Of course switching to postgres may be a sound choice on other fronts.
> But if that were not an option, and you only had these choices:
> 
> - significantly complexify your MySQL data structures to handle 4 in
>   20 million cases
> - just detect and throw such cases out at the outset
> 
> which would you take?

It turns out, we could have upgraded to a newer version of MySQL, which 
did handle astral characters correctly.  But, what we did was discarded 
the records containing non-BMP data.  Of course, that's a decision that 
can only be made when you understand the business requirements.  In our 
case, discarding those four records had no impact on our business, so it 
made sense.  For other people, not having the full dataset might have 
been a fatal problem.

This was just one of many MySQL problems we ran into.  Eventually, we 
decided it wasn't worth fighting with what was obviously a brain-dead 
system, and switched databases.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Rustom Mody

On Thursday, June 5, 2014 12:12:06 AM UTC+5:30, Roy Smith wrote:
>  Chris Angelico  wrote:

> > You can't ignore those. You might be able to say "Well, my program
> > will run slower if you throw these at it", but if you're going down
> > that route, you probably want the full FSR and the advantages it
> > confers on ASCII and Latin-1 strings. Binding your program to BMP-only
> > is nearly as dangerous as binding it to ASCII-only; potentially worse,
> > because you can run an awful lot of artificial tests without
> > remembering to stick in some astral characters.

> Yup.  I wrote a while(*) back about the pain I was having importing some 
> data into a MySQL(**) database which (unknown to me when I started) only 
> handled BMP.  It turns out in the entire dataset of 20-odd million 
> records, there were exactly four that had astral characters.  All of my 
> tests worked.  I didn't discover the problem until it blew up many hours 
> into the "final" production import run.

> (*) Two years?

> (**) This was not the only pain point with MySQL.  We eventually 
> switched to Postgress.

Thanks Roy for bringing up that example - I was trying to recollect
the details.  I forgot about the MySQL angle which adds a different
twist to it.

Here's my interpretation of that situation; I'd like to hear yours:

Basic problem was that MySQL handled a strict subset of what the rest
of the system (Python 2.7?)  could handle.  This meant that at a late
(and embarrassing) stage, exceptions were being thrown, from deep
within the system.

OTOH, let's say you could detect the 'error' (more correctly
'un-handle-able') at the borders of your system, say when the user
enters the data on a web-form. Would you have a problem kicking out
those characters (in both senses!) with a curt:

"Cant deal with all this supra-galactic rubble!" ?

Of course switching to postgres may be a sound choice on other fronts.
But if that were not an option, and you only had these choices:

- significantly complexify your MySQL data structures to handle 4 in
  20 million cases
- just detect and throw such cases out at the outset

which would you take?

In any case this is the choice I hear from the micropython folks
who are explicitly seeking a cutdown version of python

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Roy Smith

In article ,
 Chris Angelico  wrote:

> You can't ignore those. You might be able to say "Well, my program
> will run slower if you throw these at it", but if you're going down
> that route, you probably want the full FSR and the advantages it
> confers on ASCII and Latin-1 strings. Binding your program to BMP-only
> is nearly as dangerous as binding it to ASCII-only; potentially worse,
> because you can run an awful lot of artificial tests without
> remembering to stick in some astral characters.

Yup.  I wrote a while(*) back about the pain I was having importing some 
data into a MySQL(**) database which (unknown to me when I started) only 
handled BMP.  It turns out in the entire dataset of 20-odd million 
records, there were exactly four that had astral characters.  All of my 
tests worked.  I didn't discover the problem until it blew up many hours 
into the "final" production import run.

(*) Two years?

(**) This was not the only pain point with MySQL.  We eventually 
switched to Postgress.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Paul Rubin

Tim Chase  writes:
> As mentioned elsewhere, I've got a LOT of code that expects that
> string indexing is O(1) and rarely are those strings/offsets reused
> I'm streaming through customer/provider data files, so caching
> wouldn't do much good other than waste space and the time to maintain
> them.

I'm having trouble understanding -- if they're only used once then
what's the problem?  You're reading some enormous file into a string and
then randomly accessing it by character offset?  What size are these
strings?  I can think of a number of workarounds including language
extensions, but mostly I'd be interested in seeing some actual
benchmarks of your unmodified program under both representations.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Steven D'Aprano

On Wed, 04 Jun 2014 12:53:19 +0100, Robin Becker wrote:

> I believe that we should distinguish between glyph/character indexing
> and string indexing. Even in unicode it may be hard to decide where a
> visual glyph starts and ends. I assume most people would like to assign
> one glyph to one unicode, but that's not always possible with composed
> glyphs.
> 
>  >>> for a in (u'\xc5',u'A\u030a'):
> ...   for o in (u'\xf6',u'o\u0308'):
> ...   u=a+u'ngstr'+o+u'm'
> ...   print("%s %s" % (repr(u),u))
> ...
> u'\xc5ngstr\xf6m' Ångström
> u'\xc5ngstro\u0308m' Ångström
> u'A\u030angstr\xf6m' Ångström
> u'A\u030angstro\u0308m' Ångström
> >>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False
> 
> so even unicode doesn't always allow for O(1) glyph indexing.

What you're talking about here is "graphemes", not glyphs. Glyphs are the 
little pictures that represent the characters when written down. 
Graphemes (technically, "grapheme clusters") are the things which native 
speakers of a language believe ought to be considered a single unit. 
Think of them as similar to letters. That can be quite tricky to 
determine, and is dependent on the language you are speaking. The letters 
"ch" are considered two letters in English, but only a single letter in 
Czech and Slovak.

I believe that *grapheme-aware* text processing is *far* too complicated 
for a programming language to promise. If you think that len() needs to 
count graphemes, then what should len("ch") return, 1 or 2? Grapheme 
processing is a complex, complicated task best left up to powerful 
libraries built on top of a sturdy Unicode base.

> I know this is artificial, 

But it isn't artificial in the least. Unicode isn't complicated because 
it's badly designed, or complicated for the sake of complexity. It's 
complicated because human language is complicated. That, and because of 
legacy encodings.

> but this is the same situation as utf8 faces just
> the frequency of occurrence is different. A very large amount of
> computing is still western centric so searching a byte string for latin
> characters is still efficient; searching for an n with a tilde on top
> might not be so easy.

This is a good point, but on balance I disagree. A grapheme-aware library 
is likely to need to be based on more complex data structures than simple 
strings (arrays of code points). But for the underlying relatively simple 
string library, graphemes are too hard. Code points are simple, and the 
language can deal with code points without caring about their semantics. 
For instance, in English, I might not want to insert letters between the 
q and u of "queen", since in English u (nearly) always follows q. It 
would be inappropriate for the programming language string library to 
care about that, and similarly it would be inappropriate for it to care 
that u'A\u030a' represents a single grapheme Å.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker


On 04/06/2014 13:17, Marko Rauhamaa wrote:
.


Note, for example, that Google manages to sort out issues like these. It
sees past diacritics and even case ending.

.
I guess they must normalize all inputs to some standard form and then search / 
eigenvectorize on those. There are quite a few diacritics and a fair few glyphs 
they could be applied to. I don't think it likely they could map all possible 
combinations to a private range.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase

On 2014-06-04 14:57, Marko Rauhamaa wrote:
> > If you use UTF-8 for everything, then you end up in a world where
> > string-indexing (see ChrisA's other side thread on this topic) is
> > no longer an O(1) operation, but an O(N) operation.  
> 
> Most string operations are O(N) anyway. Besides, you could try and
> be smart and keep a recent index cached so simple for loops would
> be O(N) instead of O(N**2). So the idea of keeping strings
> internally in UTF-8 might not be all that bad.

As mentioned elsewhere, I've got a LOT of code that expects that
string indexing is O(1) and rarely are those strings/offsets reused
I'm streaming through customer/provider data files, so caching
wouldn't do much good other than waste space and the time to maintain
them.

If I knew that string indexing was O(something non constant), I'd
have retooled my algorithms to take that into consider, but that
would be a lot of code I'd need to touch.

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase

On 2014-06-04 12:53, Robin Becker wrote:
> > If you use UTF-8 for everything, then you end up in a world where
> > string-indexing (see ChrisA's other side thread on this topic) is
> > no longer an O(1) operation, but an O(N) operation.  Some of us
> > slice strings for a living. ;-)
> 
> I believe that we should distinguish between glyph/character
> indexing and string indexing. 

I'm only talking about string indexing using my_string[some_slice]
which is traditionally O(1) and breaking that [cw]ould cause
unexpected performance degradation.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Marko Rauhamaa

Robin Becker :

 u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
> False

Now *that* would be a valid reason for our resident Unicode expert to
complain! Py3 in no way solves text representation issues definitively.

> I know this is artificial

Not at all. It probably is out of scope for Python, but it is a real
cause for human suffering. What's Unicode for "résumé"?

Note, for example, that Google manages to sort out issues like these. It
sees past diacritics and even case ending.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Marko Rauhamaa

Tim Chase :

> On 2014-06-04 00:58, Paul Rubin wrote:
>> I've never understood why not use UTF-8 for everything.
>
> If you use UTF-8 for everything, then you end up in a world where
> string-indexing (see ChrisA's other side thread on this topic) is no
> longer an O(1) operation, but an O(N) operation.

Most string operations are O(N) anyway. Besides, you could try and be
smart and keep a recent index cached so simple for loops would be O(N)
instead of O(N**2). So the idea of keeping strings internally in UTF-8
might not be all that bad.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker

On 04/06/2014 12:01, Tim Chase wrote:

On 2014-06-04 00:58, Paul Rubin wrote:

Steven D'Aprano  writes:

Maybe there's a use-case for a microcontroller that works in
ISO-8859-5 natively, thus using only eight bits per character,

That won't even make the Russians happy, since in Russia there
are multiple incompatible legacy encodings.

I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

I believe that we should distinguish between glyph/character indexing and string 
indexing. Even in unicode it may be hard to decide where a visual glyph starts 
and ends. I assume most people would like to assign one glyph to one unicode, 
but that's not always possible with composed glyphs.

>>> for a in (u'\xc5',u'A\u030a'):
... for o in (u'\xf6',u'o\u0308'):
... u=a+u'ngstr'+o+u'm'
... print("%s %s" % (repr(u),u))
...
u'\xc5ngstr\xf6m' Ångström
u'\xc5ngstro\u0308m' Ångström
u'A\u030angstr\xf6m' Ångström
u'A\u030angstro\u0308m' Ångström
>>> u'\xc5ngstr\xf6m'==u'\xc5ngstro\u0308m'
False

so even unicode doesn't always allow for O(1) glyph indexing. I know this is 
artificial, but this is the same situation as utf8 faces just the frequency of 
occurrence is different. A very large amount of computing is still western 
centric so searching a byte string for latin characters is still efficient; 
searching for an n with a tilde on top might not be so easy.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Tim Chase

On 2014-06-04 00:58, Paul Rubin wrote:
> Steven D'Aprano  writes:
> >> Maybe there's a use-case for a microcontroller that works in
> >> ISO-8859-5 natively, thus using only eight bits per character, 
> > That won't even make the Russians happy, since in Russia there
> > are multiple incompatible legacy encodings.
> 
> I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree).  One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded
piece. 

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Robin Becker


On 04/06/2014 08:58, Paul Rubin wrote:

Steven D'Aprano  writes:

Maybe there's a use-case for a microcontroller that works in ISO-8859-5
natively, thus using only eight bits per character,

That won't even make the Russians happy, since in Russia there are
multiple incompatible legacy encodings.


I've never understood why not use UTF-8 for everything.


me too

-mojibaked-ly yrs-
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Wolfgang Maier


On 04.06.2014 09:16, Chris Angelico wrote:

The point is
not that you might be able to get away with sticking your head in the
sand and wishing Unicode would just go away. Even if you can, it's not
something Python 3 can ever do.



Exactly. These endless discussions about different encodings start to 
get really boring. I cannot think of any aspect of it that hasn't been 
discussed here on several occasions, but as a fact:


"Strings are immutable sequences of Unicode code points" in Python3 
(https://docs.python.org/3/library/stdtypes.html?highlight=str#textseq) 
and this is not an implementation detail. So if any "implementation" 
doesn't stick to this convention, it is simply incomplete.



And I don't think anybody can, anyway. If your device is big enough to
hold Python, it should be big enough to handle Unicode; and then you
don't have to say "Oh, sorry rest-of-the-world, this only works in
English... and only a subset of English... and stuff".



Wolfgang
--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Paul Rubin

Steven D'Aprano  writes:
>> Maybe there's a use-case for a microcontroller that works in ISO-8859-5
>> natively, thus using only eight bits per character, 
> That won't even make the Russians happy, since in Russia there are 
> multiple incompatible legacy encodings.

I've never understood why not use UTF-8 for everything.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Steven D'Aprano

On Wed, 04 Jun 2014 17:16:13 +1000, Chris Angelico wrote:

> On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody 
> wrote:
>> On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
>>> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
>>> > And so a pure BMP-supporting implementation may be a reasonable
>>> > compromise. [As long as no surrogate-pairs are there]
>>
>>> Not if you're working on the internet. There are several critical
>>> groups of characters that aren't in the BMP, such as:
>>
>> Of course. But what has the internet to do with micropython?

When I download a script from the Internet to run on my microcontroller, 
written by somebody in Greece, and it calls print on a Greek string, I 
should see Greek text even if I'm in Sweden or New Zealand or Japan.

A fully localised application would be better, of course, but failing 
that I shouldn't see moji-bake.


> Earlier you said:
> 
>> IOW from pov of a universallly acceptable character set this is mostly
>> rubbish
> 
> "Universally acceptable character set" and microcontrollers may well not
> meet, but if you're talking about universality, you need Unicode. It's
> that simple.

 
> Maybe there's a use-case for a microcontroller that works in ISO-8859-5
> natively, thus using only eight bits per character, 

That won't even make the Russians happy, since in Russia there are 
multiple incompatible legacy encodings.



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 3:02 PM, Ian Kelly  wrote:
> On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody  wrote:
>>> 1) Most or all Chinese and Japanese characters
>>
>> Dont know how you count 'most'
>>
>> | One possible rationale is the desire to limit the size of the full
>> | Unicode character set, where CJK characters as represented by discrete
>> | ideograms may approach or exceed 100,000 (while those required for
>> | ordinary literacy in any language are probably under 3,000). Version 1
>> | of Unicode was designed to fit into 16 bits and only 20,940 characters
>> | (32%) out of the possible 65,536 were reserved for these CJK Unified
>> | Ideographs. Later Unicode has been extended to 21 bits allowing many
>> | more CJK characters (75,960 are assigned, with room for more).
>>
>> | From http://en.wikipedia.org/wiki/Han_unification
>
> So there are 20,940 CJK characters in the BMP, and approximately
> 55,000 more in the SIP.  I'd count 55,000 out of 75,960 as "most".

And I said "or all" because I have this vague notion that either NFC
or NFD pushes stuff out of the BMP, although I may be wrong on that.
But certainly 55K/75K "with room for more" is the "most" that I was
talking about. (Maybe it isn't "most" by usage. After all, hypertext
documents are usually smaller in UTF-8 than in UTF-16, despite "most
characters" (counting purely by 21-bit space in codepoints) being more
compact in UTF-16; most by usage is of ASCII, because hypertext
involves a lot of punctuation and such. But still, there are a lot of
CJK that aren't in the BMP.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 2:40 PM, Rustom Mody  wrote:
> On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
>> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
>> > And so a pure BMP-supporting implementation may be a reasonable
>> > compromise. [As long as no surrogate-pairs are there]
>
>> Not if you're working on the internet. There are several critical
>> groups of characters that aren't in the BMP, such as:
>
> Of course. But what has the internet to do with micropython?

Earlier you said:

> IOW from pov of a universallly acceptable character set this is mostly
> rubbish

"Universally acceptable character set" and microcontrollers may well
not meet, but if you're talking about universality, you need Unicode.
It's that simple.

Maybe there's a use-case for a microcontroller that works in
ISO-8859-5 natively, thus using only eight bits per character, but
even if there is, I would expect a Python implementation on it to
expose Unicode codepoints in its strings. (Most of the time you won't
even be aware of the exact codepoint values. It's only when you put
\xNN or \u or U000N escapes into your strings, or explicitly
use ord/chr or equivalent, that it'd make a difference.) The point is
not that you might be able to get away with sticking your head in the
sand and wishing Unicode would just go away. Even if you can, it's not
something Python 3 can ever do.

And I don't think anybody can, anyway. If your device is big enough to
hold Python, it should be big enough to handle Unicode; and then you
don't have to say "Oh, sorry rest-of-the-world, this only works in
English... and only a subset of English... and stuff".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 5:00 PM, Terry Reedy  wrote:
> On 6/4/2014 1:55 AM, Ian Kelly wrote:
>>
>>
>> On Jun 3, 2014 11:27 PM, "Steven D'Aprano" > > wrote:
>>  > For technical reasons which I don't fully understand, Unicode only
>>  > uses 21 of those 32 bits, giving a total of 1114112 available code
>>  > points.
>>
>> I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
>> sufficient to encode up to 16 supplementary planes, so if Unicode were
>> allowed to grow any larger than that, UTF-16 would no longer be able to
>> encode all codepoints.
>
>
> I believe the original utf-8 used up to 6 bytes per char to encode 2**32
> potential chars. Just 4 bytes limits to 2**21 and for whatever reason
> (easier decoding?), utf-8 was revised down (unusual ;-).

I understood it to be UTF-16's fault, per Ian's statement. That is to
say, the entire Unicode standard was warped around the problem that
some people were going around thinking "a character is 16 bits", even
though that's just as fallacious as "a character is 8 bits".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-04 Thread Terry Reedy

On 6/4/2014 1:55 AM, Ian Kelly wrote:

On Jun 3, 2014 11:27 PM, "Steven D'Aprano" mailto:st...@pearwood.info>> wrote:
 > For technical reasons which I don't fully understand, Unicode only
 > uses 21 of those 32 bits, giving a total of 1114112 available code
 > points.

I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
sufficient to encode up to 16 supplementary planes, so if Unicode were
allowed to grow any larger than that, UTF-16 would no longer be able to
encode all codepoints.

I believe the original utf-8 used up to 6 bytes per char to encode 2**32 
potential chars. Just 4 bytes limits to 2**21 and for whatever reason 
(easier decoding?), utf-8 was revised down (unusual ;-).

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Ian Kelly

On Jun 3, 2014 11:27 PM, "Steven D'Aprano"  wrote:
> For technical reasons which I don't fully understand, Unicode only
> uses 21 of those 32 bits, giving a total of 1114112 available code
> points.

I think mainly it's to accommodate UTF-16. The surrogate pair scheme is
sufficient to encode up to 16 supplementary planes, so if Unicode were
allowed to grow any larger than that, UTF-16 would no longer be able to
encode all codepoints.

Another benefit of fixing the size is that it frees the other 11 bits per
character of UTF-32 for packing in ancillary data.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody

On Wednesday, June 4, 2014 10:50:21 AM UTC+5:30, Steven D'Aprano wrote:
> On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:
> > And so a pure BMP-supporting implementation may be a reasonable
> > compromise. [As long as no surrogate-pairs are there]

> At the cost on one extra bit, strings could use UTF-16 internally and 
> still have correct behaviour. The bit could be a flag recording whether 
> the string contains any surrogate pairs. If the flag was 0, all string 
> operations could assume a constant 2-bytes-per-character. If the flag was 
> 1, it could fall back to walking the string checking for surrogate pairs.

Yes.  That could be one possibility.  My main reason in giving the
4-engine choice was not that 4 engines are a good idea but that in the
very differently constrained world of μ-controllers playing around with
alternate binding times may be advantageous


> > On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
> >> With that in mind, I, as many others, think that forcing Unicode bloat
> >> upon people by default is the most controversial feature of Python3.
> >> The reason is that you go very long way dealing with languages of the
> >> people of the world by just treating strings as consisting of 8-bit
> >> data. I'd say, that's enough for 90% of applications. Unicode is needed
> >> only if one needs to deal with multiple languages *at the same time*,
> >> which is fairly rare (remaining 10% of apps).
> >> And please keep in mind that MicroPython was originally intended (and
> >> should be remain scalable down to) an MCU. Unicode needed there is even
> >> less, and even less resources to support Unicode just because.
> > At some time (when jmf was making more intelligible noises) I had
> > suggested that the choice between 1/2/4 byte strings that happens at
> > runtime in python3's FSR can be made at python-start time with a
> > command-line switch.  There are many combinations here; here is one in
> > more detail:
> > Instead of having one (FSR) string engine, you have (upto) 4
> > - a pure 1 byte (ASCII)

> There are only 128 ASCII characters, so a pure ASCII implementation 
> cannot even represent arbitrary bytes.

Yes this is a subtle point.
I was initially going to write Latin-1. Wrote a rough-n-ready ASCII.
But maybe it could be a choice.

I really dont understand the binding-times of μ-controllers.

My impression is that actual development is split 
1 tinkering with the board
2 working on full powered computers and downloading to the board

In going from 2 to 1 heavy amounts of cut-downs are probably possible and
desirable. If this is the case, having hooks in the system for making choices 
may be a good idea
optimal choices may be worthwhile
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Steven D'Aprano

On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:

> On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
> 
>> With that in mind, I, as many others, think that forcing Unicode bloat
>> upon people by default is the most controversial feature of Python3.
>> The reason is that you go very long way dealing with languages of the
>> people of the world by just treating strings as consisting of 8-bit
>> data. I'd say, that's enough for 90% of applications. Unicode is needed
>> only if one needs to deal with multiple languages *at the same time*,
>> which is fairly rare (remaining 10% of apps).
> 
>> And please keep in mind that MicroPython was originally intended (and
>> should be remain scalable down to) an MCU. Unicode needed there is even
>> less, and even less resources to support Unicode just because.
> 
> At some time (when jmf was making more intelligible noises) I had
> suggested that the choice between 1/2/4 byte strings that happens at
> runtime in python3's FSR can be made at python-start time with a
> command-line switch.  There are many combinations here; here is one in
> more detail:
> 
> Instead of having one (FSR) string engine, you have (upto) 4
> 
> - a pure 1 byte (ASCII)

There are only 128 ASCII characters, so a pure ASCII implementation 
cannot even represent arbitrary bytes.

> - a pure 2 byte (BMP) with decode-failures for out-of-ranges

That's not Unicode. It's a subset of Unicode.

> - a pure 4 byte -- everything UTF-32

For embedded devices, that would be extremely memory hungry. Remember, 
every variable, every attribute name, every method and class and function 
name is a string. Using at least 56 bytes just to refer to 
sys.stdout.write will be painful.

> - FSR dynamic switching at runtime (with massive moping from the world's
> jmfs)

Please stop giving JMF's crackpot opinion even the dignity of being 
sneered at.

[...]
> 2. My casual/cursory reading of the contents of the SMP-planes suggests
> that the stuff there is are things like - egyptian hieroplyphics
> - mahjong characters
> - ancient greek musical symbols
> - alchemical symbols etc etc.
> 
> IOW from pov of a universallly acceptable character set this is mostly
> rubbish

Certainly some of these things are more whimsical than practical, but it 
doesn't really matter. Even if you strip out every bit of whimsy from the 
Unicode character set, you're still left with needing more than 65536 
characters (16 bits). For efficiency you aren't going to use 17 bits, or 
18, or 19, so it's actually faster and more efficient to jump right to 32 
bits. For technical reasons which I don't fully understand, Unicode only 
uses 21 of those 32 bits, giving a total of 1114112 available code 
points. Whether you or I personally have need for alchemical symbols, 
*some people* do, and supporting their use-case doesn't harm us by one 
bit.

> And so a pure BMP-supporting implementation may be a reasonable
> compromise. [As long as no surrogate-pairs are there]

At the cost on one extra bit, strings could use UTF-16 internally and 
still have correct behaviour. The bit could be a flag recording whether 
the string contains any surrogate pairs. If the flag was 0, all string 
operations could assume a constant 2-bytes-per-character. If the flag was 
1, it could fall back to walking the string checking for surrogate pairs.

-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Ian Kelly

On Tue, Jun 3, 2014 at 10:40 PM, Rustom Mody  wrote:
>> 1) Most or all Chinese and Japanese characters
>
> Dont know how you count 'most'
>
> | One possible rationale is the desire to limit the size of the full
> | Unicode character set, where CJK characters as represented by discrete
> | ideograms may approach or exceed 100,000 (while those required for
> | ordinary literacy in any language are probably under 3,000). Version 1
> | of Unicode was designed to fit into 16 bits and only 20,940 characters
> | (32%) out of the possible 65,536 were reserved for these CJK Unified
> | Ideographs. Later Unicode has been extended to 21 bits allowing many
> | more CJK characters (75,960 are assigned, with room for more).
>
> | From http://en.wikipedia.org/wiki/Han_unification

So there are 20,940 CJK characters in the BMP, and approximately
55,000 more in the SIP.  I'd count 55,000 out of 75,960 as "most".
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody

On Wednesday, June 4, 2014 9:22:54 AM UTC+5:30, Chris Angelico wrote:
> On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody wrote:
> > And so a pure BMP-supporting implementation may be a reasonable
> > compromise. [As long as no surrogate-pairs are there]

> Not if you're working on the internet. There are several critical
> groups of characters that aren't in the BMP, such as:

Of course. But what has the internet to do with micropython?

This is their stated goal:
| Micro Python is a lean and fast implementation of the Python
| programming language (python.org) that is optimised to run on a
| microcontroller.

> 1) Most or all Chinese and Japanese characters

Dont know how you count 'most'

| One possible rationale is the desire to limit the size of the full
| Unicode character set, where CJK characters as represented by discrete
| ideograms may approach or exceed 100,000 (while those required for
| ordinary literacy in any language are probably under 3,000). Version 1
| of Unicode was designed to fit into 16 bits and only 20,940 characters
| (32%) out of the possible 65,536 were reserved for these CJK Unified
| Ideographs. Later Unicode has been extended to 21 bits allowing many
| more CJK characters (75,960 are assigned, with room for more).

| From http://en.wikipedia.org/wiki/Han_unification
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico

On Wed, Jun 4, 2014 at 1:37 PM, Rustom Mody  wrote:
> 2. My casual/cursory reading of the contents of the SMP-planes
> suggests that the stuff there is are things like
> - egyptian hieroplyphics
> - mahjong characters
> - ancient greek musical symbols
> - alchemical symbols etc etc.
>
> IOW from pov of a universallly acceptable character set this is mostly
> rubbish
>
> And so a pure BMP-supporting implementation may be a reasonable
> compromise. [As long as no surrogate-pairs are there]

Not if you're working on the internet. There are several critical
groups of characters that aren't in the BMP, such as:

1) Most or all Chinese and Japanese characters
2) Heaps of emoticons and fancy letters
3) Mathematical symbols

You can't ignore those. You might be able to say "Well, my program
will run slower if you throw these at it", but if you're going down
that route, you probably want the full FSR and the advantages it
confers on ASCII and Latin-1 strings. Binding your program to BMP-only
is nearly as dangerous as binding it to ASCII-only; potentially worse,
because you can run an awful lot of artificial tests without
remembering to stick in some astral characters.

It's not rubbish. It's important stuff that you need to deal with.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Rustom Mody

On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:

> With that in mind, I, as many others, think that forcing Unicode bloat
> upon people by default is the most controversial feature of Python3.
> The reason is that you go very long way dealing with languages of the
> people of the world by just treating strings as consisting of 8-bit
> data. I'd say, that's enough for 90% of applications. Unicode is needed
> only if one needs to deal with multiple languages *at the same time*,
> which is fairly rare (remaining 10% of apps).

> And please keep in mind that MicroPython was originally intended (and
> should be remain scalable down to) an MCU. Unicode needed there is even
> less, and even less resources to support Unicode just because.

At some time (when jmf was making more intelligible noises) I had
suggested that the choice between 1/2/4 byte strings that happens at
runtime in python3's FSR can be made at python-start time with a
command-line switch.  There are many combinations here; here is one in
more detail:

Instead of having one (FSR) string engine, you have (upto) 4

- a pure 1 byte (ASCII)
- a pure 2 byte (BMP) with decode-failures for out-of-ranges
- a pure 4 byte -- everything UTF-32
- FSR dynamic switching at runtime (with massive moping from the world's jmfs)

The point is that only one of these engines would be brought into memory
based on command-line/config options.

Some more personal thoughts (that may be quite ill-informed!):

1. I regard myself as a unicode ignoramus+enthusiast. The world will
be a better place if unicode is more pervasive.
See http://blog.languager.org/2014/04/unicoded-python.html

As it happens I am also a computer scientist -- I understand that in
contexts where anything other than 8-bit chars is unacceptably
inefficient, unicode-bloat may be a real thing.

2. My casual/cursory reading of the contents of the SMP-planes
suggests that the stuff there is are things like
- egyptian hieroplyphics
- mahjong characters
- ancient greek musical symbols
- alchemical symbols etc etc.

IOW from pov of a universallly acceptable character set this is mostly
rubbish

And so a pure BMP-supporting implementation may be a reasonable
compromise. [As long as no surrogate-pairs are there]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico

On Wed, Jun 4, 2014 at 7:41 AM, Paul Sokolovsky  wrote:
> Hello,
>
> On Wed, 4 Jun 2014 03:08:57 +1000
> Chris Angelico  wrote:
>
> []
>
>> With that encouragement, I just cloned your repo and built it on amd64
>> Debian Wheezy. Works just fine! Except... I've just found one fairly
>> major problem with your support of Python 3.x syntax. Your str type is
>> documented as not supporting Unicode. Is that a current flaw that
>> you're planning to remove, or a design limitation? Either way, I'm a
>> bit dubious about a purported version 1 that doesn't do one of the
>> things that Py3 is especially good at - matched by very few languages
>> in its encouragement of best practice with Unicode support.
>
> I should start with saying that it's MicroPython what made me look at
> Python3. So for me, it already did lot of boon by getting me from under
> the rock, so now instead of "at my job, we use python 2.x" I may report
> "at my job, we don't wait when our distro will kick us in the ass, and
> add 'from __future__ import print_function' whenever we touch some
> code".

And that's a good thing :) Using Python 2.7 and starting to put in the
future directives breaks nothing, and will save you time later.

> With that in mind, I, as many others, think that forcing Unicode bloat
> upon people by default is the most controversial feature of Python3.
> The reason is that you go very long way dealing with languages of the
> people of the world by just treating strings as consisting of 8-bit
> data. I'd say, that's enough for 90% of applications. Unicode is needed
> only if one needs to deal with multiple languages *at the same time*,
> which is fairly rare (remaining 10% of apps).

Absolutely not. This is the mentality that results in web applications
that break on "funny characters", which is completely the wrong way to
look at it. The truth is, there are not many funny characters in
Unicode at all; I found these, but that's about it:

http://www.fileformat.info/info/unicode/char/1F601/index.htm
http://www.fileformat.info/info/unicode/char/1F638/index.htm

Your code should accept any valid character with equal correctness.
(Note to jmf: Correctness does not necessarily imply exact nanosecond
performance, just that the right result is reached.) These days,
Unicode *is* needed everywhere. You might think you can get away with
"8-bit data", but is that 8-bit data actually encoded Latin-1 or
UTF-8? There's a vast difference between them, and you'll hit it in
any English text with U+00A9 ©, or U+201C U+201D quotes, or any of a
large number of other common non-ASCII characters. Oh, and the three I
just mentioned happen to be in CP-1252, another common 8-bit encoding,
and a lot of people and programs don't know how to tell CP-1252 from
Latin-1 and label one as the other.

Unicode is needed on anything that touches the internet, which is a
*lot* more than 10% of applications. Unicode is also needed on
anything that shares files with anyone who speaks more than one
language, or uses any symbol that isn't in ASCII, or pretty much
anything beyond plain English with a restricted set of punctuation.
And even if you can guarantee that you're working only with English
and only with ASCII, you still need to be aware that ASCII text is
different "stuff" from a JPEG file, although it's possible to bury
your head in the sand over that one.

> But generally, there's no strict roadmap for MicroPython features.
> While core of the language (parser, compiler, VM) is developed by
> Damien, many other features were already contributed by the community
> (project went open-source at the beginning of the year). So, if someone
> will want to see Unicode support up to the level of providing patches,
> it gladly will be accepted. The only thing we established is that we
> want to be able to scale down, and thus almost all features should be
> configurable.

And that's exactly what's happening right now.

https://github.com/micropython/micropython/issues/657
https://github.com/Rosuav/micropython

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 03:08:57 +1000
Chris Angelico  wrote:

[]

> With that encouragement, I just cloned your repo and built it on amd64
> Debian Wheezy. Works just fine! Except... I've just found one fairly
> major problem with your support of Python 3.x syntax. Your str type is
> documented as not supporting Unicode. Is that a current flaw that
> you're planning to remove, or a design limitation? Either way, I'm a
> bit dubious about a purported version 1 that doesn't do one of the
> things that Py3 is especially good at - matched by very few languages
> in its encouragement of best practice with Unicode support.

I should start with saying that it's MicroPython what made me look at
Python3. So for me, it already did lot of boon by getting me from under
the rock, so now instead of "at my job, we use python 2.x" I may report
"at my job, we don't wait when our distro will kick us in the ass, and
add 'from __future__ import print_function' whenever we touch some
code".

With that in mind, I, as many others, think that forcing Unicode bloat
upon people by default is the most controversial feature of Python3.
The reason is that you go very long way dealing with languages of the
people of the world by just treating strings as consisting of 8-bit
data. I'd say, that's enough for 90% of applications. Unicode is needed
only if one needs to deal with multiple languages *at the same time*,
which is fairly rare (remaining 10% of apps).

And please keep in mind that MicroPython was originally intended (and
should be remain scalable down to) an MCU. Unicode needed there is even
less, and even less resources to support Unicode just because.

> 
> What is your str type actually able to support? It seems to store
> non-ASCII bytes in it, which I presume are supposed to represent the
> rest of Latin-1, but I wasn't able to print them out:

There's a work-in-progress on documenting differences between CPython
and MicroPython at
https://github.com/micropython/micropython/wiki/Differences, it gives
following account on this:

"No unicode support is actually implemented. Python3 calls for strict
difference between str and bytes data types (unlike Python2, which has
neutral unified data type for strings and binary data, and separates
out unicode data type). MicroPython faithfully implements str/bytes
separation, but currently, underlying str implementation is the same as
bytes. This means strings in MicroPython are not unicode, but 8-bit
characters (fully binary-clean)."

> 
> Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
> >>> print("asdf\xfdqwer")
> 
> Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
> [GCC 4.7.2] on linux
> >>> print("asdf\xfdqwer")
> asdfýqwer
> 
> In fact, printing seems to work with bytes:
> 
> >>> print("asdf\xc3\xbdqwer")
> asdfýqwer
> 
> (my terminal uses UTF-8, this is the UTF-8 encoding of the above
> string)
> 
> I would strongly recommend either implementing all of PEP 393, or at
> least making it very clear that this pretends everything is bytes -
> and possibly disallowing any codepoint >127 in any string, which will
> at least mean you're safe on all ASCII-compatible encodings.

MicroPython is not the first "tiny" Python implementation. What differs
MicroPython is that it's neither aim or motto to be a subset of
language. And yet, it's not CPython rewrite either. So, while Unicode
support is surely possible, it's unlikely to be done as "all of
PEPxxx". If you ask me, I'd personally envision it to be implemented as
UTF-8 (in this regard I agree with (or take an influence from) 
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/). But I don't have plans
to work on Unicode any time soon - applications I envision for
MicroPython so far fit in those 90% that live happily without Unicode.

But generally, there's no strict roadmap for MicroPython features.
While core of the language (parser, compiler, VM) is developed by
Damien, many other features were already contributed by the community
(project went open-source at the beginning of the year). So, if someone
will want to see Unicode support up to the level of providing patches,
it gladly will be accepted. The only thing we established is that we
want to be able to scale down, and thus almost all features should be
configurable.

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico

On Wed, Jun 4, 2014 at 2:49 AM, Paul Sokolovsky  wrote:
> As can be seen from the dump above, MicroPython perfectly works on a
> Linux system, so we encourage any pythonista to touch a little bit of
> Python magic and give it a try! ;-) And we of course interested to get
> feedback how portable it is, etc.
>

With that encouragement, I just cloned your repo and built it on amd64
Debian Wheezy. Works just fine! Except... I've just found one fairly
major problem with your support of Python 3.x syntax. Your str type is
documented as not supporting Unicode. Is that a current flaw that
you're planning to remove, or a design limitation? Either way, I'm a
bit dubious about a purported version 1 that doesn't do one of the
things that Py3 is especially good at - matched by very few languages
in its encouragement of best practice with Unicode support.

What is your str type actually able to support? It seems to store
non-ASCII bytes in it, which I presume are supposed to represent the
rest of Latin-1, but I wasn't able to print them out:

Micro Python v1.0.1-144-gb294a7e on 2014-06-04; UNIX version
>>> print("asdf\xfdqwer")

Python 3.5.0a0 (default:6a0def54c63d, Mar 26 2014, 01:11:09)
[GCC 4.7.2] on linux
>>> print("asdf\xfdqwer")
asdfýqwer

In fact, printing seems to work with bytes:

>>> print("asdf\xc3\xbdqwer")
asdfýqwer

(my terminal uses UTF-8, this is the UTF-8 encoding of the above string)

I would strongly recommend either implementing all of PEP 393, or at
least making it very clear that this pretends everything is bytes -
and possibly disallowing any codepoint >127 in any string, which will
at least mean you're safe on all ASCII-compatible encodings.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Paul Sokolovsky

Hello,

On Tue, 3 Jun 2014 23:11:46 +1000
Chris Angelico  wrote:

> On Tue, Jun 3, 2014 at 10:27 PM, Damien George
>  wrote:
> > - Supports almost full Python 3 syntax, including yield (compiles
> > 99.99% of the Python 3 standard library).
> > - It supports a growing subset of Python 3 types and operations.
> > - Part of the Python 3 standard library has already been ported to
> > Micro Python, and work is ongoing to port as much as feasible.
> 
> I don't have an actual use-case for this, as I don't target
> microcontrollers, 

Please let me chime in, as one of MicroPython contributors. I also
don't have immediate usecase for a Python microcontroller (but seeing
how fast industry moves, I won't be surprised if in half-year it will
seem just right). Instead, I treat MicroPython as a Python
implementation which scales *down* very well. With current situation in
the industry, people mostly care about scaling up - consume more
gigabytes and gigahertz, catch more clouds and include heavier and
heavier batteries.

MicroPython goes another direction. You don't have to use it on a
microcontroller. It's just if you want/need it, you'll be able - while
still staying with your favorite language.

I'm personally interested in using MicroPython on a small embedded
Linux systems, like home routers, Internet-of-Thing devices, etc. Such
devices usually have just few hundreds of megahertz of CPU power, and
2-4MB of flash. And to cut cost, the lower bound decreases all the
time.

> but I'm curious: What parts of Py3 syntax aren't
> supported? And since you say "port as much as feasible", presumably
> there'll be parts that are never supported. Are there some syntactic
> elements that just take up way too much memory?

Syntax-wise, all Python 3.3 syntax is supported. This includes things
like yield from, annotations, etc. For example:

$ micropython 
Micro Python v1.0.1-139-g411732e on 2014-06-03; UNIX version
>>> def foo(a:int) -> float:
... return float(a)
... 
>>> foo(4)
4.0

"99.9%" statement is due to fact that there were some problems parsing
couple of files in CPython 3.3/3.4 stdlib.

Note that above talks about syntax, not semantics. Though core
language semantics is actually now implemented pretty well. For
example, "yield from" works pretty well, so asyncio could work ;-).
(Except my analysis showed that CPython's implementation is a bit
bloated for MicroPython requirements, so I started to write a
simplified implementation from scratch).

As can be seen from the dump above, MicroPython perfectly works on a
Linux system, so we encourage any pythonista to touch a little bit of
Python magic and give it a try! ;-) And we of course interested to get
feedback how portable it is, etc.

(As a side note, it's of course possible to compile and run MicroPython
on Windows too, it's a bit more complicated than just "make".)

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Steven D'Aprano

On Tue, 03 Jun 2014 13:27:11 +0100, Damien George wrote:

> Hi,
> 
> We would like to announce Micro Python, an implementation of Python 3
> optimised to have a low memory footprint.

Fantastic!




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Chris Angelico

On Tue, Jun 3, 2014 at 10:27 PM, Damien George
 wrote:
> - Supports almost full Python 3 syntax, including yield (compiles
> 99.99% of the Python 3 standard library).
> - It supports a growing subset of Python 3 types and operations.
> - Part of the Python 3 standard library has already been ported to
> Micro Python, and work is ongoing to port as much as feasible.

I don't have an actual use-case for this, as I don't target
microcontrollers, but I'm curious: What parts of Py3 syntax aren't
supported? And since you say "port as much as feasible", presumably
there'll be parts that are never supported. Are there some syntactic
elements that just take up way too much memory?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Micro Python -- a lean and efficient implementation of Python 3

2014-06-03 Thread Damien George

Hi,

We would like to announce Micro Python, an implementation of Python 3
optimised to have a low memory footprint.

While Python has many attractive features, current implementations
(read CPython) are not suited for embedded devices, such as
microcontrollers and small systems-on-a-chip.  This is because CPython
uses an awful lot of RAM -- both stack and heap -- even for simple
things such as integer addition.

Micro Python is a new implementation of the Python 3 language, which
aims to be properly compatible with CPython, while sporting a very
minimal RAM footprint, a compact compiler, and a fast and efficient
runtime.  These goals have been met by employing many tricks with
pointers and bit stuffing, and placing as much as possible in
read-only memory.

Micro Python has the following features:

- Supports almost full Python 3 syntax, including yield (compiles
99.99% of the Python 3 standard library).
- Most scripts use significantly less RAM in Micro Python, and various
benchmark programs run faster, compared with CPython.
- A minimal ARM build fits in 80k of program space, and with all
features enabled it fits in around 200k on Linux.
- Micro Python needs only 2k RAM for a basic REPL.
- It has 2 modes of AOT (ahead of time) compilation to native machine
code, doubling execution speed.
- There is an inline assembler for use in time-critical
microcontroller applications.
- It is written in C99 ANSI C and compiles cleanly under Unix (POSIX),
Mac OS X, Windows and certain ARM based microcontrollers.
- It supports a growing subset of Python 3 types and operations.
- Part of the Python 3 standard library has already been ported to
Micro Python, and work is ongoing to port as much as feasible.

More info at:

http://micropython.org/

You can follow the progress and contribute at github:

www.github.com/micropython/micropython
www.github.com/micropython/micropython-lib

--
Damien / Micro Python team.
-- 
https://mail.python.org/mailman/listinfo/python-list

49 matches

Mail list logo