Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Glenn Linderman writes:

  3) (Most space efficient) One cached entry, that caches the last 
  codepoint/byte position referenced. UTF-8 is able to be traversed in 
  either direction, so next/previous codepoint access would be 
  relatively fast (and such are very common operations, even when indexing 
  notation is used: for ix in range( len( str_x )): func( str_x[ ix ]).)

Been there, tried that (Emacsen).  Either it's a YAGNI (moving forward
or backward over UTF-8 by characters short distances is plenty fast,
especially if you've got a lot of ASCII you can move by words for
somewhat longer distances), or it's not good enough.  There *may* be a
sweet spot, but it's definitely smaller than the one on Sharapova's
racket.

  4) (Fixed size caches)  N entries, one for the last codepoint, and 
  others at Codepoint_Length/N intervals.  N could be tunable.

To achieve space saving, cache has to be quite small, and the bigger
your integers, the smaller it gets.  A naive implementation on 64-bit
machine would give you 16 bytes/cache entry.  Using a non-native size
will be a space win, but needs care in implementation.  Initializing
the cache is very expensive for small strings, so you need conditional
and maybe lazy initialization (for large strings).

By the way, there's also

10) Keep counts of the leading and trailing number of ASCII
(one-octet) characters.  This is often a *huge* win; it's quite
common to encounter documents where size - lc - tc = 2 (ie,
there's only one two-octet character in the document).

11) Keep a list (or tree) of most-recently-accessed positions.

Despite my negative experience with multibyte encodings in Emacsen,
I'm persuaded by the arguments that there probably aren't all that
many places in core Python where indexing is used in an essential way,
so MicroPython itself can probably optimize those behind the
scenes.  Application programmers in the embedded context may be
expected to be deal with the need to avoid random access algorithms
and use iterators and generators to accomplish most tasks.




___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

05.06.14 03:03, Greg Ewing написав(ла):

Serhiy Storchaka wrote:

html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't
use iterators. They use indices, str.find and/or regular expressions.
Common use case is quickly find substring starting from current
position using str.find or re.search, process found token, advance
position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


Of course. But _existing_ Python interfaces all work with indices. And 
it is too late to change this, this train was gone 20 years ago.


There is no need in yet one way to do string operations. One obvious way 
is enough.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

04.06.14 23:50, Glenn Linderman написав(ла):

3) (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in
either direction, so next/previous codepoint access would be
relatively fast (and such are very common operations, even when indexing
notation is used: for ix in range( len( str_x )): func( str_x[ ix ]).)


Great idea! It should cover most real-word cases. Note that we can scan 
UTF-8 string left-to-right and right-to-left.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Paul Sokolovsky writes:

  Please put that in perspective when alarming over O(1) indexing of
  inherently problematic niche datatype. (Again, it's not my or
  MicroPython's fault that it was forced as standard string type. Maybe
  if CPython seriously considered now-standard UTF-8 encoding, results
  of what is str type might be different. But CPython has gigabytes of
  heap to spare, and for MicroPython, every half-bit is precious).

Would you please stop trolling?  The reasons for adopting Unicode as a
separate data type were good and sufficient in 2000, and they remain
so today, even if you have been fortunate enough not to burn yourself
on character-byte conflation yet.

What matters to you is that str (unicode) is an opaque type -- there
is no specification of the internal representation in the language
reference, and in fact several different ones coexist happily across
existing Python implementations -- and you're free to use a UTF-8
implementation if that suits the applications you expect for
MicroPython.

PEP 393 exists, of course, and specifies the current internal
representation for CPython 3.  But I don't see anything in it that
suggests it's mandated for any other implementation.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

05.06.14 05:25, Terry Reedy написав(ла):

I mentioned it as an alternative during the '393 discussion. I more than
half agree that the FSR is the better choice for CPython, which had no
particular attachment to UTF-16 in the way that I think Jython, for
instance, does.


Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is 
used instead of UCS4) is the better choice for CPython. I suppose that 
with populating emoticons and other icon characters in nearest 5 or 10 
years, even English text will often contain astral characters. And 
spending 4 bytes per character if long text contains one astral 
character looks too prodigally.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Serhiy Storchaka writes:

  Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is 
  used instead of UCS4) is the better choice for CPython. I suppose that 
  with populating emoticons and other icon characters in nearest 5 or 10 
  years, even English text will often contain astral characters. And 
  spending 4 bytes per character if long text contains one astral 
  character looks too prodigally.

Why use something that complex if you don't have to?  For the use case
you have in mind, just map them into private space.  If you really
want to be aggressive, use surrogate space, too (anything that cares
what a scalar represents should be trapping on non-scalars, catch that
exception and look up the char -- dangerous, though, because such
exceptions are probably all over the place).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Request: new Asyncio component on the bug tracker

2014-06-05 Thread Victor Stinner
Hi,

Would it be possible to add a new Asyncio component on
bugs.python.org? If this component is selected, the default nosy list
for asyncio would be used (guido, yury and me, there is already such
list in the nosy list completion).

Full text search for asyncio returns too many results.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy tjre...@udel.edu wrote:

 On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:
 
  Well is subjective (or should be defined formally based on the
  requirements). With my MicroPython hat on, an implementation which
  receives a string, transcodes it, leading to bigger size, just to
  immediately transcode back and send out - is awful, environment
  unfriendly implementation ;-).
 
 I am not sure what you concretely mean by 'receive a string', but I 

I (surely) mean an abstract input (as an Input/Output aka I/O)
operation.

 think you are again batting at a strawman. If you mean 'read from a 
 file', and all you want to do is read bytes from and write bytes to 
 external 'files', then there is obviously no need to transcode and 
 neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use binary data type for them,
with all attached funny things, like b prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).

[]



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 05 Jun 2014 16:54:11 +0900
Stephen J. Turnbull step...@xemacs.org wrote:

 Paul Sokolovsky writes:
 
   Please put that in perspective when alarming over O(1) indexing of
   inherently problematic niche datatype. (Again, it's not my or
   MicroPython's fault that it was forced as standard string type.
   Maybe if CPython seriously considered now-standard UTF-8 encoding,
   results of what is str type might be different. But CPython has
   gigabytes of heap to spare, and for MicroPython, every half-bit is
   precious).
 
 Would you please stop trolling?  The reasons for adopting Unicode as a
 separate data type were good and sufficient in 2000, and they remain

If it was kept at separate data type bay, there wouldn't be any
problem. But it was made one and only string type, and all strife
started then.

And there going to be trolling as long as Python developers and
decision-makers will ignore (troll?) outcry from the community (again, I
was surprised and not surprised to see ~50% of traffic on python-list
touches Unicode issues). 

Well, I understand the plan - hoping that people will get over this.
And I'm personally happy to stay away from this trolling, but any
discussion related to Unicode goes in circles and returns to feeling
that Unicode at the central role as put there by Python3 is misplaced.

Then for me, it's just a matter of job security and personal future - I
don't want to spend rest of my days as a javascript (or other idiotic
language) monkey. And the message is clear in the air
(http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ and
elsewhere): if Python strings are now in Go, and in Python itself are
now Java strings, all causing strife, why not go cruising around and see
what's up, instead of staying strong, and growing bigger, community.

 so today, even if you have been fortunate enough not to burn yourself
 on character-byte conflation yet.
 
 What matters to you is that str (unicode) is an opaque type -- there
 is no specification of the internal representation in the language
 reference, and in fact several different ones coexist happily across
 existing Python implementations -- and you're free to use a UTF-8
 implementation if that suits the applications you expect for
 MicroPython.
 
 PEP 393 exists, of course, and specifies the current internal
 representation for CPython 3.  But I don't see anything in it that
 suggests it's mandated for any other implementation.

I knew all this before very well. What's strange is that other
developers don't know, or treat seriously, all of the above. That's why
gentleman who kindly was interested in adding Unicode support to
MicroPython started with the idea of dragging in CPython implementation.
And the only effect persuasion that it's not necessarily the best
solution had, was that he started to feel that he's being manipulated
into writing something ugly, instead of the bright idea he had.

That's why another gentleman reduces it to: O(1) on string indexing or
not a Python!. 

And that's why another gentleman, who agrees to UTF-8 arguments, still
gives an excuse
(https://mail.python.org/pipermail/python-dev/2014-June/134727.html):
In this context, while a fixed-width encoding may be the correct
choice it would also likely be the wrong choice.


In this regard, I'm glad to participate in mind-resetting discussion.
So, let's reiterate - there's nothing like the best, the only right,
the only correct, righter than, more correct than in CPython's
implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
not arbitrary, but based on requirements, and these requirements match
CPython's (implied) usage model well enough. But among all possible
sets of requirements, CPython's requirements are no more valid that
other possible. And other set of requirement fairly clearly lead to
situation where CPython implementation is rejected as not correct for
those requirements at all.



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 17:54, Stephen J. Turnbull step...@xemacs.org wrote:
 What matters to you is that str (unicode) is an opaque type -- there
 is no specification of the internal representation in the language
 reference, and in fact several different ones coexist happily across
 existing Python implementations -- and you're free to use a UTF-8
 implementation if that suits the applications you expect for
 MicroPython.

However, as others have noted in the thread, the critical thing is to
*not* let that internal implementation detail leak into the Python
level string behaviour. That's what happened with narrow builds of
Python 2 and pre-PEP-393 releases of Python 3 (effectively using
UTF-16 internally), and it was the cause of a sufficiently large
number of bugs that the Linux distributions tend to instead accept the
memory cost of using wide builds (4 bytes for all code points) for
affected versions.

Preserving the the Python 3 str type is an immutable array of code
points semantics matters significantly more than whether or not
indexing by code point is O(1). The various caching tricks suggested
in this thread (especially leading ASCII characters, trailing ASCII
characters and position  index of last lookup) could keep the
typical lookup performance well below O(N).

 PEP 393 exists, of course, and specifies the current internal
 representation for CPython 3.  But I don't see anything in it that
 suggests it's mandated for any other implementation.

CPython is constrained by C API compatibility requirements, as well as
implementation constraints due to the amount of internal code that
would need to be rewritten to handle a variable width encoding as the
canonical internal representation (since the problems with Python 2
narrow builds mean we already know variable width encodings aren't
handled correctly by the current code).

Implementations that share code with CPython, or try to mimic the C
API especially closely, may face similar restrictions. Outside that, I
think we're better off if alternative implementations are free to
experiment with different internal string representations.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 21:25, Paul Sokolovsky pmis...@gmail.com wrote:
 Well, I understand the plan - hoping that people will get over this.
 And I'm personally happy to stay away from this trolling, but any
 discussion related to Unicode goes in circles and returns to feeling
 that Unicode at the central role as put there by Python3 is misplaced.

Many of the challenges network programmers face in Python 3 are around
binary data being more inconvenient to work with than it needs to be,
not the fact we decentralised boundary code by offering a strict
binary/text separation as the default mode of operation. Aside from
some of the POSIX locale handling issues on Linux, many of the
concerns are with the usability of bytes and bytearray, not with str -
that's why binary interpolation is coming back in 3.5, and there will
likely be other usability tweaks for those types as well.

More on that at
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 21:43:16 +1000
Nick Coghlan ncogh...@gmail.com wrote:

 On 5 June 2014 21:25, Paul Sokolovsky pmis...@gmail.com wrote:
  Well, I understand the plan - hoping that people will get over
  this. And I'm personally happy to stay away from this trolling,
  but any discussion related to Unicode goes in circles and returns
  to feeling that Unicode at the central role as put there by Python3
  is misplaced.
 
 Many of the challenges network programmers face in Python 3 are around
 binary data being more inconvenient to work with than it needs to be,
 not the fact we decentralised boundary code by offering a strict
 binary/text separation as the default mode of operation. 

Just to clarify - (many) other gentlemen and I (in that order, I'm not
taking a lead), don't call to go back to Python2 behavior with implicit
conversion between byte-oriented strings and Unicode, etc. They just
point out that perhaps Python3 went too far with Unicode cause by making
it the default string type. Strict separation is surely mostly good
thing (I can sigh that it leads to Java-like dichotomical bloat for all
I/O classes, but well, I was able to put up with that in MicroPython
already).

 Aside from
 some of the POSIX locale handling issues on Linux, many of the
 concerns are with the usability of bytes and bytearray, not with str -
 that's why binary interpolation is coming back in 3.5, and there will
 likely be other usability tweaks for those types as well.

All these changes are what let me dream on and speculate on
possibility that Python4 could offer an encoding-neutral string type
(which means based on bytes), while move unicode back to an explicit
type to be used explicitly only when needed (bloated frameworks like
Django can force users to it anyway, but that will be forcing on
framework level, not on language level, against which people rebel.)
People can dream, right?


Thanks,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stefan Krah
Paul Sokolovsky pmis...@gmail.com wrote:
 In this regard, I'm glad to participate in mind-resetting discussion.
 So, let's reiterate - there's nothing like the best, the only right,
 the only correct, righter than, more correct than in CPython's
 implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
 not arbitrary, but based on requirements, and these requirements match
 CPython's (implied) usage model well enough. But among all possible
 sets of requirements, CPython's requirements are no more valid that
 other possible. And other set of requirement fairly clearly lead to
 situation where CPython implementation is rejected as not correct for
 those requirements at all.

Several core-devs have said that using UTF-8 for MicroPython is perfectly okay.
I also think it's the right choice and I hope that you guys come up with a very
efficient implementation.


Stefan Krah


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:01, Paul Sokolovsky pmis...@gmail.com wrote:
 Aside from
 some of the POSIX locale handling issues on Linux, many of the
 concerns are with the usability of bytes and bytearray, not with str -
 that's why binary interpolation is coming back in 3.5, and there will
 likely be other usability tweaks for those types as well.

 All these changes are what let me dream on and speculate on
 possibility that Python4 could offer an encoding-neutral string type
 (which means based on bytes), while move unicode back to an explicit
 type to be used explicitly only when needed (bloated frameworks like
 Django can force users to it anyway, but that will be forcing on
 framework level, not on language level, against which people rebel.)
 People can dream, right?

If you don't model strings as arrays of code points, or at least
assume a particular universal encoding (like UTF-8), you have to give
up string concatenation in order to tolerate arbitrary encodings -
otherwise you end up with unintelligible data that nobody can decode
because it switches encodings without notice. That's a viable model if
your OS guarantees it (Mac OS X does, for example, so Python 3 assumes
UTF-8 for all OS interfaces there), but Linux currently has no such
guarantee - many runtimes just decide they don't care, and assume
UTF-8 anyway (Python 3 may even join them some day, due to the
problems caused by trusting the locale encoding to be correct, but the
startup code will need non-trivial changes for that to happen - the
C.UTF-8 locale may even become widespread before we get there).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Tim Delaney
On 5 June 2014 22:01, Paul Sokolovsky pmis...@gmail.com wrote:


 All these changes are what let me dream on and speculate on
 possibility that Python4 could offer an encoding-neutral string type
 (which means based on bytes)


To me, an encoding neutral string type means roughly characters are
atomic, and the best representation we have for a character is a Unicode
code point. Through any interface that provides characters each
individual character (code point) is indivisible.

To me, Python 3 has exactly an encoding-neutral string type. It also has
a bytes type that is is just that - bytes which can represent anything at
all.It might be the UTF-8 representation of a string, but you have the
freedom to manipulate it however you like - including making it no longer
valid UTF-8.

Whilst I think O(1) indexing of strings is important, I don't think it's as
important as the property that characters are indivisible and would be
quite happy for MicroPython to use UTF-8 as the underlying string
representation (or some more clever thing, several ideas in this thread) so
long as:

1. It maintains a string type that presents code points as indivisible
elements;

2. The performance consequences of using UTF-8 are documented, as well as
any optimisations, tricks, etc that are used to overcome those consequences
(and what impact if any they would have if code written for MicroPython was
run in CPython).

Cheers,

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 22:20:04 +1000
Nick Coghlan ncogh...@gmail.com wrote:

[]
 problems caused by trusting the locale encoding to be correct, but the
 startup code will need non-trivial changes for that to happen - the
 C.UTF-8 locale may even become widespread before we get there).

... And until those golden times come, it would be nice if Python did
not force its perfect world model, which unfortunately is not based on
surrounding reality, and let users solve their encoding problems
themselves - when they need, because again, one can go quite a long way
without dealing with encodings at all. Whereas now Python3 forces users
to deal with encoding almost universally, but forcing a particular for
all strings (which is again, doesn't correspond to the state of
surrounding reality). I already hear response that it's good that users
taught to deal with encoding, that will make them write correct
programs, but that's a bit far away from the original aim of making it
write correct programs easy and pleasant. (And definition of
correct vary.)

But all that is just an opinion.

 
 Cheers,
 Nick.
 
 -- 
 Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:10, Stefan Krah ste...@bytereef.org wrote:
 Paul Sokolovsky pmis...@gmail.com wrote:
 In this regard, I'm glad to participate in mind-resetting discussion.
 So, let's reiterate - there's nothing like the best, the only right,
 the only correct, righter than, more correct than in CPython's
 implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
 not arbitrary, but based on requirements, and these requirements match
 CPython's (implied) usage model well enough. But among all possible
 sets of requirements, CPython's requirements are no more valid that
 other possible. And other set of requirement fairly clearly lead to
 situation where CPython implementation is rejected as not correct for
 those requirements at all.

 Several core-devs have said that using UTF-8 for MicroPython is perfectly 
 okay.
 I also think it's the right choice and I hope that you guys come up with a 
 very
 efficient implementation.

Based on this discussion , I've also posted a draft patch aimed at
clarifying the relevant aspects of the data model section of the
language reference (http://bugs.python.org/issue21667).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:37, Paul Sokolovsky pmis...@gmail.com wrote:
 On Thu, 5 Jun 2014 22:20:04 +1000
 Nick Coghlan ncogh...@gmail.com wrote:
 problems caused by trusting the locale encoding to be correct, but the
 startup code will need non-trivial changes for that to happen - the
 C.UTF-8 locale may even become widespread before we get there).

 ... And until those golden times come, it would be nice if Python did
 not force its perfect world model, which unfortunately is not based on
 surrounding reality, and let users solve their encoding problems
 themselves - when they need, because again, one can go quite a long way
 without dealing with encodings at all. Whereas now Python3 forces users
 to deal with encoding almost universally, but forcing a particular for
 all strings (which is again, doesn't correspond to the state of
 surrounding reality). I already hear response that it's good that users
 taught to deal with encoding, that will make them write correct
 programs, but that's a bit far away from the original aim of making it
 write correct programs easy and pleasant. (And definition of
 correct vary.)

As I've said before in other contexts, find me Windows, Mac OS X and
JVM developers, or educators and scientists that are as concerned by
the text model changes as folks that are primarily focused on Linux
system (including network) programming, and I'll be more willing to
concede the point.

Windows, Mac OS X, and the JVM are all opinionated about the text
encodings to be used at platform boundaries (using UTF-16, UTF-8 and
UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX)
says well, it's configurable, but we won't provide a reliable
mechanism for finding out what the encoding is. So either guess as
best you can based on the info the OS *does* provide, assume UTF-8,
assume 'some ASCII compatible encoding', or don't do anything that
requires knowing the encoding of the data being exchanged with the OS,
like, say, displaying file names to users or accepting arbitrary text
as input, transforming it in a content aware fashion, and echoing it
back in a console application.

None of those options are perfectly good choices. 6(ish) years ago, we
chose the first option, because it has the best chance of working
properly on Linux systems that use ASCII incompatible encodings like
ShiftJIS, ISO-2022, and various other East Asian codecs. For normal
user space programming, Linux is pretty reliable when it comes to
ensuring the locale encoding is set to something sensible, but the
price we currently pay for that decision is interoperability issues
with things like daemons not receiving any configuration settings and
hence falling back the POSIX locale and ssh environment forwarding
moving a clients encoding settings to a session on a server with
different settings. I still consider it preferable to impose
inconveniences like that based on use case (situations where Linux
systems don't provide sensible encoding settings) than geographic
region (locales where ASCII incompatible encodings are likely to still
be in common use).

If I (or someone else) ever find the time to implement PEP 432 (or
something like it) to address some of the limitations of the
interpreter startup sequence that currently make it difficult to avoid
relying on the POSIX locale encoding on Linux, then we'll be in a
position to reassess that decision based on the increased adoption of
UTF-8 by Linux distributions in recent years. As the major community
Linux distributions complete the migration of their system utilities
to Python 3, we'll get to see if they decide it's better to make their
locale settings more reliable, or help make it easier for Python 3 to
ignore them when they're wrong.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Steven D'Aprano
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
 There is a discussion over at MicroPython about the internal 
 representation of Unicode strings. Micropython is aimed at embedded 
 devices, and so minimizing memory use is important, possibly even 
 more important than performance.
[...]

Wow! I'm amazed at the response here, since I expected it would have 
received a fairly brief Yes or No response, not this long thread. 
Here is a summary (as best as I am able) of a few points which I think 
are important:

(1) I asked if it would be okay for MicroPython to *optionally* use 
nominally Unicode strings limited to ASCII. Pretty much the only 
response to this as been Guido saying That would be a pretty lousy 
option, and since nobody has really defended the suggestion, I think we 
can assume that it's off the table.

(2) I asked if it would be okay for µPy to use an UTF-8 implementation 
even though it would lead to O(N) indexing operations instead of O(1). 
There's been some opposition to this, including Guido's:

Then again the UTF-8 option would be pretty devastating 
too for anything manipulating strings (especially since 
many Python APIs are defined using indexes, e.g. the re 
module).

but unless Guido wants to say different, I think the consensus is that 
a UTF-8 implementation is allowed, even at the cost of O(N) indexing 
operations. Saving memory -- assuming that it does save memory, which I 
think is an assumption and not proven -- over time is allowed.

(3) It seems to me that there's been a lot of theorizing about what 
implementation will be obviously more efficient. Folks, how about some 
benchmarks before making claims about code efficiency? :-)

(4) Similarly, there have been many suggestions more suited in my 
opinion to python-ideas, or even python-list, for ways to implement O(1) 
indexing on top of UTF-8. Some of them involve per-string mutable state 
(e.g. the last index seen), or complicated int sub-classes that need to 
know what string they come from. Remember your Zen please:

Simple is better than complex.
Complex is better than complicated.
...
If the implementation is hard to explain, it's a bad idea.

(5) I'm not convinced that UTF-8 internally is *necessarily* more 
efficient, but look forward to seeing the result of benchmarks. The 
rationale of internal UTF-8 is that the use of any other encoding 
internally will be inefficient since those strings will need to be 
transcoded to UTF-8 before they can be written or printed, so keeping 
them as UTF-8 in the first place saves the transcoding step. Well, yes, 
but many strings may never be written out:

print(prefix + s[1:].strip().lower().center(80) + suffix)

creates five strings that are never written out and one that is. So if 
the internal encoding of strings is more efficient than UTF-8, and most 
of them never need transcoding to UTF-8, a non-UTF-8 internal format 
might be a nett win. So I'm looking forward to seeing the results of 
µPy's experiments with it.

Thanks to all who have commented.



-- 
Steven

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Request: new Asyncio component on the bug tracker

2014-06-05 Thread R. David Murray
On Thu, 05 Jun 2014 12:03:15 +0200, Victor Stinner victor.stin...@gmail.com 
wrote:
 Would it be possible to add a new Asyncio component on
 bugs.python.org? If this component is selected, the default nosy list
 for asyncio would be used (guido, yury and me, there is already such
 list in the nosy list completion).

Done.  There are two other people in the nosy list (Giapaolo and
Antoine).  If either of those wish to be auto-nosy, let me know.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Moore
On 5 June 2014 14:15, Nick Coghlan ncogh...@gmail.com wrote:
 As I've said before in other contexts, find me Windows, Mac OS X and
 JVM developers, or educators and scientists that are as concerned by
 the text model changes as folks that are primarily focused on Linux
 system (including network) programming, and I'll be more willing to
 concede the point.

There is once again a strong selection bias in this discussion, by its
very nature. People who like the new model don't have anything to
complain about, and so are not heard.

Just to support Nick's point, I for one find the Python 3 text model a
huge benefit, both in practical terms of making my programs more
robust, and educationally, as I have a far better understanding of
encodings and their issues than I ever did under Python 2. Whenever a
discussion like this occurs, I find it hard not to resent the people
arguing that the new model should be taken away from me and replaced
with a form of the old error-prone (for me) approach - as if it was in
my best interests.

Internal details don't bother me - using UTF8 and having indexing be
potentially O(N) is of little relevance. But make me work with a
string type that *doesn't* abstract a string as a sequence of Unicode
code points and I'll get very upset.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Daniel Holth
On Thu, Jun 5, 2014 at 11:59 AM, Paul Moore p.f.mo...@gmail.com wrote:
 On 5 June 2014 14:15, Nick Coghlan ncogh...@gmail.com wrote:
 As I've said before in other contexts, find me Windows, Mac OS X and
 JVM developers, or educators and scientists that are as concerned by
 the text model changes as folks that are primarily focused on Linux
 system (including network) programming, and I'll be more willing to
 concede the point.

 There is once again a strong selection bias in this discussion, by its
 very nature. People who like the new model don't have anything to
 complain about, and so are not heard.

 Just to support Nick's point, I for one find the Python 3 text model a
 huge benefit, both in practical terms of making my programs more
 robust, and educationally, as I have a far better understanding of
 encodings and their issues than I ever did under Python 2. Whenever a
 discussion like this occurs, I find it hard not to resent the people
 arguing that the new model should be taken away from me and replaced
 with a form of the old error-prone (for me) approach - as if it was in
 my best interests.

 Internal details don't bother me - using UTF8 and having indexing be
 potentially O(N) is of little relevance. But make me work with a
 string type that *doesn't* abstract a string as a sequence of Unicode
 code points and I'll get very upset.

Once you get past whether str + bytes throws an exception which seems
to be the divide most people focus on, you can discover new things
like dance-encoded strings, bytes decoded using an incorrect encoding
intended to be transcoded into the correct encoding later, surrogates
that work perfectly until .encode(), str(bytes), APIs that disagree
with you about whether the result should be str or bytes, APIs that
return either string or bytes depending on their initializers and so
on. Unicode can still be complicated in Python 3 independent of any
judgement about whether it is worse, better, or different than Python
2.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Glenn Linderman

On 6/5/2014 3:10 AM, Paul Sokolovsky wrote:

Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy tjre...@udel.edu wrote:


think you are again batting at a strawman. If you mean 'read from a
file', and all you want to do is read bytes from and write bytes to
external 'files', then there is obviously no need to transcode and
neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use binary data type for them,
with all attached funny things, like b prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).


If all your program is doing is reading and writing data (input data 
will be output as-is), then use of binary doesn't require b prefix, 
because you aren't manipulating the data. Then you have no unnecessary 
transcoding.


If you actually wish to examine or manipulate the content as it flows 
by, then there are choices.


1) If you need to examine/manipulate only a small fraction of text data 
with the file, you can pay the small price of a few b prefixes to get 
high performance, and explicitly transcode only the portions that need 
to be manipulated.


2) If you are examining the bulk of the data as it flows by, but not 
manipulating it, just examining/extracting, then a full transcoding may 
be useful for that purpose... but you can perhaps do it explicitly, so 
that you keep the binary form for I/O. Careful of the block boundaries, 
in this case, however.


3) If you are actually manipulating the bulk of the data, then the 
double transcoding (once on input, and once on output) allows you to 
work in units of codepoints, rather than bytes, which generally makes 
the manipulation algorithms easier.


4) If you truly cannot afford the processor code of the double 
transcoding, and need to do all your manipulations at the byte level, 
then you could avoid the need for b prefix by use of a preprocessor 
for those sections of code that are doing all and only bytes 
processing... and you'll have lots of arcane, error-prone code to write 
to manipulate the bytes rather than the codepoints.


On the other hand, if you can convince your data sources and sinks to 
deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid 
transcoding, and make the arcane algorithms part of the implementation 
of μPy rather than of the application code, and support full Unicode. 
And it seems to me that the world is moving that way... towards UTF-8 as 
the standard interchange format. Encourage it.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Glenn Linderman

On 6/5/2014 11:41 AM, Daniel Holth wrote:

discover new things
like dance-encoded strings, bytes decoded using an incorrect encoding
intended to be transcoded into the correct encoding later, surrogates
that work perfectly until .encode(), str(bytes), APIs that disagree
with you about whether the result should be str or bytes, APIs that
return either string or bytes depending on their initializers and so
on. Unicode can still be complicated in Python 3 independent of any
judgement about whether it is worse, better, or different than Python
2.

Yes, people can find ways to write bad code in any language.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Antoine Pitrou

Le 04/06/2014 02:51, Chris Angelico a écrit :

On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan ncogh...@gmail.com wrote:

It would. The downsides of a UTF-8 representation would be slower
iteration and much slower (O(N)) indexing/slicing.


There's no reason for iteration to be slower. Slicing would get
O(slice offset + slice size) instead of O(slice size).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
Hi all,

There's a very valuable optimization -- temporary elision -- which
numpy can *almost* do. It gives something like a 10-30% speedup for
lots of common real-world expressions. It would probably would be
useful for non-numpy code too. (In fact it generalizes the str += str
special case that's currently hardcoded in ceval.c.) But it can't be
done safely without help from the interpreter, and possibly not even
then. So I thought I'd raise it here and see if we can get any
consensus on whether and how CPython could support this.

=== The dream ===

Here's the idea. Take an innocuous expression like:

   result = (a + b + c) / c

This gets evaluated as:

   tmp1 = a + b
   tmp2 = tmp1 + c
   result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are
arrays with N bytes each, and N is large. For simple arithmetic like
this, then costs are dominated by memory access. Allocating an N byte
array requires the kernel to clear the memory, which incurs N bytes of
memory traffic. If all the operands are already allocated, then
performing a three-operand operation like tmp1 = a + b involves 3N
bytes of memory traffic (reading the two inputs plus writing the
output). In total our example does 3 allocations and has 9 operands,
so it does 12N bytes of memory access.

If our arrays are small, then the kernel doesn't get involved and some
of these accesses will hit the cache, but OTOH the overhead of things
like malloc won't be amortized out; the best case starting from a cold
cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if
we get lucky and malloc'ing 'result' returns the same memory that tmp1
used, and it's still in cache).

There's an obvious missed optimization in this code, though, which is
that it keeps allocating new temporaries and throwing away old ones.
It would be better to just allocate a temporary once and re-use it:

   tmp1 = a + b
   tmp1 += c
   tmp1 /= c
   result = tmp1

Now we have only 1 allocation and 7 operands, so we touch only 8N
bytes of memory. For large arrays -- that don't fit into cache, and
for which per-op overhead is amortized out -- this gives a theoretical
33% speedup, and we can realistically get pretty close to this. For
smaller arrays, the re-use of tmp1 means that in the best case we have
only 1 malloc and 4N bytes worth of cache misses, and we also have a
smaller cache footprint, which means this best case will be achieved
more often in practice. For small arrays it's harder to estimate the
total speedup here, but 66% fewer mallocs and 33% fewer cache misses
is certainly enough to make a practical difference.

Such optimizations are important enough that numpy operations always
give the option of explicitly specifying the output array (like
in-place operators but more general and with clumsier syntax). Here's
an example small-array benchmark that IIUC uses Jacobi iteration to
solve Laplace's equation. It's been written in both natural and
hand-optimized formats (compare num_update to num_inplace):

   https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

num_inplace is totally unreadable, but because we've manually elided
temporaries, it's 10-15% faster than num_update. With our prototype
automatic temporary elision turned on, this difference disappears --
the natural code gets 10-15% faster, *and* we remove the temptation to
write horrible things like num_inplace.

What do I mean by automatic temporary elision? It's *almost*
possible for numpy to automatically convert the first example into the
second. The idea is: we want to replace

  tmp2 = tmp1 + c

with

  tmp1 += c
  tmp2 = tmp1

And we can do this by defining

   def __add__(self, other):
   if is_about_to_be_thrown_away(self):
   return self.__iadd__(other)
   else:
   ...

now tmp1.__add__(c) does an in-place add and returns tmp1, no
allocation occurs, woohoo.

The only little problem is implementing is_about_to_be_thrown_away().

=== The sneaky-but-flawed approach ===

The following implementation may make you cringe, but it comes
tantalizingly close to working:

bool is_about_to_be_thrown_away(PyObject * obj) {
return (Py_REFCNT(obj) == 1);
}

In fact, AFAICT it's 100% correct for libraries being called by
regular python code (which is why I'm able to quote benchmarks at you
:-)). The bytecode eval loop always holds a reference to all operands,
and then immediately DECREFs them after the operation completes. If
one of our arguments has no other references besides this one, then we
can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and
sometimes they call Python libraries without going through ceval.c
:-(. It's legal for random C code to hold an array object with a
single reference count, and then call PyNumber_Add on it, and then
expect the original array object to still be valid. But who writes
code like that in practice? Well, Cython does. So, 

Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Paul Moore
On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote:
 Is there a better idea I'm missing?

Just a thought, but the temporaries come from the stack manipulation
done by the likes of the BINARY_ADD opcode. (After all the bytecode
doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and
friends could allow for an alternative fast calling convention for
__add__implementations that uses the stack slots directly? This may be
something that's only plausible from C code, though. Or may not be
plausible at all. I haven't looked at ceval.c for many years...

If this is an insane idea, please feel free to ignore me :-)

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On Thu, Jun 5, 2014 at 10:37 PM, Paul Moore p.f.mo...@gmail.com wrote:
 On 5 June 2014 21:51, Nathaniel Smith n...@pobox.com wrote:
 Is there a better idea I'm missing?

 Just a thought, but the temporaries come from the stack manipulation
 done by the likes of the BINARY_ADD opcode. (After all the bytecode
 doesn't use temporaries, it's a stack machine). Maybe BINARY_ADD and
 friends could allow for an alternative fast calling convention for
 __add__implementations that uses the stack slots directly? This may be
 something that's only plausible from C code, though. Or may not be
 plausible at all. I haven't looked at ceval.c for many years...

 If this is an insane idea, please feel free to ignore me :-)

To make sure I understand correctly, you're suggesting something like
adding a new set of special method slots, __te_add__, __te_mul__,
etc., which BINARY_ADD and friends would check for and if found,
dispatch to without going through PyNumber_Add? And this way, a type
like numpy's array could have a special implementation for __te_add__
that works the same as __add__, except with the added wrinkle that it
knows that it will only be called by the interpreter and thus any
arguments with refcnt 1 must be temporaries?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Paul Moore
On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote:
 To make sure I understand correctly, you're suggesting something like
 adding a new set of special method slots, __te_add__, __te_mul__,
 etc.

I wasn't thinking in that much detail, TBH. I'm not sure adding a
whole set of new slots is sensible for such a specialised case. I
think I was more assuming that the special method implementations
could use an alternative calling convention, METH_STACK in place of
METH_VARARGS, for example. That would likely only be viable for types
implemented in C.

But either way, it may be more complicated than the advantages would justify...
Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Terry Reedy

On 6/5/2014 4:51 PM, Nathaniel Smith wrote:


In fact, AFAICT it's 100% correct for libraries being called by
regular python code (which is why I'm able to quote benchmarks at you
:-)). The bytecode eval loop always holds a reference to all operands,
and then immediately DECREFs them after the operation completes. If
one of our arguments has no other references besides this one, then we
can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and
sometimes they call Python libraries without going through ceval.c
:-(. It's legal for random C code to hold an array object with a
single reference count, and then call PyNumber_Add on it, and then
expect the original array object to still be valid. But who writes
code like that in practice? Well, Cython does. So, this is no-go.


I understand that a lot of numpy/scipy code is compiled with Cython, so 
you really want the optimization to continue working when so compiled. 
Is there a simple change to Cython that would work, perhaps in 
coordination with a change to numpy? Is so, you could get the result 
before 3.5 comes out.


I realized that there are other compilers than Cython and non-numpy code 
that could benefit, so that a more generic solution would also be good. 
In particular


 Here's the idea. Take an innocuous expression like:

 result = (a + b + c) / c

 This gets evaluated as:

 tmp1 = a + b
 tmp2 = tmp1 + c
 result = tmp2 / c
...
 There's an obvious missed optimization in this code, though, which is
 that it keeps allocating new temporaries and throwing away old ones.
 It would be better to just allocate a temporary once and re-use it:
 tmp1 = a + b
 tmp1 += c
 tmp1 /= c
 result = tmp1

Could this transformation be done in the ast? And would that help?

A prolonged discussion might be better on python-ideas. See what others say.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On Thu, Jun 5, 2014 at 11:12 PM, Paul Moore p.f.mo...@gmail.com wrote:
 On 5 June 2014 22:47, Nathaniel Smith n...@pobox.com wrote:
 To make sure I understand correctly, you're suggesting something like
 adding a new set of special method slots, __te_add__, __te_mul__,
 etc.

 I wasn't thinking in that much detail, TBH. I'm not sure adding a
 whole set of new slots is sensible for such a specialised case. I
 think I was more assuming that the special method implementations
 could use an alternative calling convention, METH_STACK in place of
 METH_VARARGS, for example. That would likely only be viable for types
 implemented in C.

 But either way, it may be more complicated than the advantages would 
 justify...

Oh, I see, that's clever. But, unfortunately most __special__ methods
at the C level don't use METH_*, they just have hard-coded calling
conventions:
  https://docs.python.org/3/c-api/typeobj.html#number-structs

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 6 Jun 2014 05:13, Glenn Linderman v+pyt...@g.nevcal.com wrote:

 On 6/5/2014 11:41 AM, Daniel Holth wrote:

 discover new things
 like dance-encoded strings, bytes decoded using an incorrect encoding
 intended to be transcoded into the correct encoding later, surrogates
 that work perfectly until .encode(), str(bytes), APIs that disagree
 with you about whether the result should be str or bytes, APIs that
 return either string or bytes depending on their initializers and so
 on. Unicode can still be complicated in Python 3 independent of any
 judgement about whether it is worse, better, or different than Python
 2.

 Yes, people can find ways to write bad code in any language.

Note that several of the issues Daniel mentions here are due to the lack of
reliable encoding settings on Linux and the challenges of the Py2-3
migration, rather than users writing bad code. Several of them represent
bugs to be fixed or serve as indicators of missing features that would make
it easier to work around an imperfect world.

Cheers,
Nick.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Greg Ewing

Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use 
nominally Unicode strings limited to ASCII. Pretty much the only 
response to this as been Guido saying That would be a pretty lousy 
option,


It would be limiting to have this as the *only* way of
dealing with unicode, but I don't see anything wrong with
having this available as an option for applications that
truly don't need anything more than ascii. There must be
plenty of those; the controller that runs my car engine,
for example, doesn't exchange text with the outside world
at all.

The 
rationale of internal UTF-8 is that the use of any other encoding 
internally will be inefficient since those strings will need to be 
transcoded to UTF-8 before they can be written or printed,


No, I think the rationale is that UTF-8 is likely to use
less memory than UTF-16 or UTF-32.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nikolaus Rath
Nathaniel Smith n...@pobox.com writes:
 Such optimizations are important enough that numpy operations always
 give the option of explicitly specifying the output array (like
 in-place operators but more general and with clumsier syntax). Here's
 an example small-array benchmark that IIUC uses Jacobi iteration to
 solve Laplace's equation. It's been written in both natural and
 hand-optimized formats (compare num_update to num_inplace):

https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

 num_inplace is totally unreadable, but because we've manually elided
 temporaries, it's 10-15% faster than num_update. 

Does it really have to be that ugly? Shouldn't using

  tmp += u[2:,1:-1]
  tmp *= dy2
  
instead of

  np.add(tmp, u[2:,1:-1], out=tmp)
  np.multiply(tmp, dy2, out=tmp)

give the same performance? (yes, not as nice as what you're proposing,
but I'm still curious).


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On 6 Jun 2014 02:16, Nikolaus Rath nikol...@rath.org wrote:

 Nathaniel Smith n...@pobox.com writes:
  Such optimizations are important enough that numpy operations always
  give the option of explicitly specifying the output array (like
  in-place operators but more general and with clumsier syntax). Here's
  an example small-array benchmark that IIUC uses Jacobi iteration to
  solve Laplace's equation. It's been written in both natural and
  hand-optimized formats (compare num_update to num_inplace):
 
 
https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace
 
  num_inplace is totally unreadable, but because we've manually elided
  temporaries, it's 10-15% faster than num_update.

 Does it really have to be that ugly? Shouldn't using

   tmp += u[2:,1:-1]
   tmp *= dy2

 instead of

   np.add(tmp, u[2:,1:-1], out=tmp)
   np.multiply(tmp, dy2, out=tmp)

 give the same performance? (yes, not as nice as what you're proposing,
 but I'm still curious).

Yes, only the last line actually requires the out= syntax, everything else
could use in place operators instead (and automatic temporary elision
wouldn't work for the last line anyway). I guess whoever wrote it did it
that way for consistency (and perhaps in hopes of eking out a tiny bit more
speed - in numpy currently the in-place operators are implemented by
dispatching to function calls like those).

Not sure how much difference it really makes in practice though. It'd still
be 8 statements and two named temporaries to do the work of one infix
expression, with order of operations implicit.

-n
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nathaniel Smith
On 5 Jun 2014 23:58, Terry Reedy tjre...@udel.edu wrote:

 On 6/5/2014 4:51 PM, Nathaniel Smith wrote:

 In fact, AFAICT it's 100% correct for libraries being called by
 regular python code (which is why I'm able to quote benchmarks at you
 :-)). The bytecode eval loop always holds a reference to all operands,
 and then immediately DECREFs them after the operation completes. If
 one of our arguments has no other references besides this one, then we
 can be sure that it is a dead obj walking, and steal its corpse.

 But this has a fatal flaw: people are unreasonable creatures, and
 sometimes they call Python libraries without going through ceval.c
 :-(. It's legal for random C code to hold an array object with a
 single reference count, and then call PyNumber_Add on it, and then
 expect the original array object to still be valid. But who writes
 code like that in practice? Well, Cython does. So, this is no-go.


 I understand that a lot of numpy/scipy code is compiled with Cython, so
you really want the optimization to continue working when so compiled. Is
there a simple change to Cython that would work, perhaps in coordination
with a change to numpy? Is so, you could get the result before 3.5 comes
out.

Unfortunately we don't actually know whether Cython is the only culprit
(such code *could* be written by hand), and even if we fixed Cython it
would take some unknowable amount of time before all downstream users
upgraded their Cythons. (It's pretty common for projects to check in
Cython-generated .c files, and only regenerate when the Cython source
actually gets modified.) Pretty risky for an optimization.

 I realized that there are other compilers than Cython and non-numpy code
that could benefit, so that a more generic solution would also be good. In
particular

  Here's the idea. Take an innocuous expression like:
 
  result = (a + b + c) / c
 
  This gets evaluated as:
 
  tmp1 = a + b
  tmp2 = tmp1 + c
  result = tmp2 / c
 ...

  There's an obvious missed optimization in this code, though, which is
  that it keeps allocating new temporaries and throwing away old ones.
  It would be better to just allocate a temporary once and re-use it:
  tmp1 = a + b
  tmp1 += c
  tmp1 /= c
  result = tmp1

 Could this transformation be done in the ast? And would that help?

I don't think it could be done in the ast because I don't think you can
work with anonymous temporaries there. But, now that you mention it, it
could be done on the fly in the implementation of the relevant opcodes.
I.e., BIN_ADD could do

if (Py_REFCNT(left) == 1)
result = PyNumber_InPlaceAdd(left, right);
else
result = PyNumber_Add(left, right)

Upside: all packages automagically benefit!

Potential downsides to consider:
- Subtle but real and user-visible change in Python semantics. I'd be a
little nervous about whether anyone has implemented, say, an iadd with side
effects such that you can tell whether a copy was made, even if the object
being copied is immediately destroyed. Maybe this doesn't make sense though.
- Only works when left operand is the temporary (remember that a*b+c is
faster than c+a*b), and only for arithmetic (no benefit for np.sin(a +
b)). Probably does cover the majority of cases though.

 A prolonged discussion might be better on python-ideas. See what others
say.

Yeah, I wasn't sure which list to use for this one, happy to move if it
would work better.

-n
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Chris Angelico
On Fri, Jun 6, 2014 at 11:47 AM, Nathaniel Smith n...@pobox.com wrote:
 Unfortunately we don't actually know whether Cython is the only culprit
 (such code *could* be written by hand), and even if we fixed Cython it would
 take some unknowable amount of time before all downstream users upgraded
 their Cythons. (It's pretty common for projects to check in Cython-generated
 .c files, and only regenerate when the Cython source actually gets
 modified.) Pretty risky for an optimization.

But code will still work, right? I mean, you miss out on an
optimization, but it won't actually be wrong code? It should be
possible to say After upgrading to Cython version x.y, regenerate all
your .c files to take advantage of this new optimization.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Greg Ewing

Nathaniel Smith wrote:

I.e., BIN_ADD could do

if (Py_REFCNT(left) == 1)
result = PyNumber_InPlaceAdd(left, right);
else
result = PyNumber_Add(left, right)

Upside: all packages automagically benefit!

Potential downsides to consider:
- Subtle but real and user-visible change in Python semantics.


That would be a real worry. Even if such cases were rare,
they'd be damnably difficult to debug when they did occur.

I think for safety's sake this should only be done if the
type concerned opts in somehow, perhaps by a tp_flag
indicating that the type is eligible for temporary
elision.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Sturla Molden

On 05/06/14 22:51, Nathaniel Smith wrote:


This gets evaluated as:

tmp1 = a + b
tmp2 = tmp1 + c
result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are
arrays with N bytes each, and N is large. For simple arithmetic like
this, then costs are dominated by memory access. Allocating an N byte
array requires the kernel to clear the memory, which incurs N bytes of
memory traffic.


It seems to be the case that a large portion of the run-time in Python 
code using NumPy can be spent in the kernel zeroing pages (which the 
kernel does for security reasons).


I think this can also be seen as a 'malloc problem'. It comes about 
because each new NumPy array starts with a fresh buffer allocated by 
malloc. Perhaps buffers can be reused?


Sturla






___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Greg Ewing

Nathaniel Smith wrote:

I'd be a 
little nervous about whether anyone has implemented, say, an iadd with 
side effects such that you can tell whether a copy was made, even if the 
object being copied is immediately destroyed.


I can think of at least one plausible scenario where
this could occur: the operand is a view object that
wraps another object, and its __iadd__ method updates
that other object.

In fact, now that I think about it, exactly this
kind of thing happens in numpy when you slice an
array!

So the opt-in indicator would need to be dynamic, on
a per-object basis, rather than a type flag.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes

2014-06-05 Thread Nikolaus Rath
Nathaniel Smith n...@pobox.com writes:
  tmp1 = a + b
  tmp1 += c
  tmp1 /= c
  result = tmp1

 Could this transformation be done in the ast? And would that help?

 I don't think it could be done in the ast because I don't think you can
 work with anonymous temporaries there. But, now that you mention it, it
 could be done on the fly in the implementation of the relevant opcodes.
 I.e., BIN_ADD could do

 if (Py_REFCNT(left) == 1)
 result = PyNumber_InPlaceAdd(left, right);
 else
 result = PyNumber_Add(left, right)

 Upside: all packages automagically benefit!

 Potential downsides to consider:
 - Subtle but real and user-visible change in Python semantics. I'd be a
 little nervous about whether anyone has implemented, say, an iadd with side
 effects such that you can tell whether a copy was made, even if the object
 being copied is immediately destroyed. Maybe this doesn't make sense
 though.

Hmm. I don't think this is as unlikely as it may sound. Consider eg the
h5py module:

with h5py.File('database.h5') as fh:
 result = fh['key'] + np.ones(42)

if this were transformed to

with h5py.File('database.h5') as fh:
tmp = fh['key']
tmp += np.ones(42)
result = tmp

then the database.h5 file would get modified, *and* result would be of
type h5py.Dataset rather than np.array.


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Greg Ewing

Paul Sokolovsky wrote:

All these changes are what let me dream on and speculate on
possibility that Python4 could offer an encoding-neutral string type
(which means based on bytes)


Can you elaborate on exactly what you have in mind?
You seem to want something different from Python 3 str,
Python 3 bytes and Python 2 str, but it's far from
clear what you want this type to be like.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Internal representation of strings and Micropython (Steven D'Aprano's summary)

2014-06-05 Thread Jim J. Jewett


Steven D'Aprano wrote:

 (1) I asked if it would be okay for MicroPython to *optionally* use 
 nominally Unicode strings limited to ASCII. Pretty much the only 
 response to this as been Guido saying That would be a pretty lousy 
 option, and since nobody has really defended the suggestion, I think we 
 can assume that it's off the table.

Lousy is not quite the same as forbidden.

Doing it in good faith would require making the limit prominent
in the documentation, and raising some sort of CharacterNotSupported
exception (or at least a warning) whenever there is an attempt to
create a non-ASCII string, even via the C API.

 (2) I asked if it would be okay ... to use an UTF-8 implementation 
 even though it would lead to O(N) indexing operations instead of O(1). 
 There's been some opposition to this, including Guido's:

[Non-ASCII character removed.]

It is bad when quirks -- even good quirks -- of one implementation lead
people to write code that will perform badly on a different Python
implementation.  Cpython has at least delayed obvious optimizations for
this reason.  Changing idiomatic operations from O(1) to O(N) is big
enough to cause a concern.

That said, the target environment itself apparently limits N to small
enough that the problem should be mostly theoretical.  If you want to
be good citizens, then do put a note in the documentation warning that
particularly long strings are likely to cause performance issues unique
to the MicroPython implementation.

(Frankly, my personal opinion is that if you're really optimizing for
space, then long strings will start getting awkward long before N is
big enough for algorithmic complexity to overcome constant factors.)

 ... those strings will need to be transcoded to UTF-8 before they
 can be written or printed, so keeping them as UTF-8 ...

That all assumes that the external world is using UTF-8 anyhow.

Which is more likely to be true if you document it as a limitation
of MicroPython.

 ... but many strings may never be written out:

print(prefix + s[1:].strip().lower().center(80) + suffix)

 creates five strings that are never written out and one that is.

But looking at the actual strings -- UTF-8 doesn't really hurt
much.  Only the slice and center() are more complex, and for a
string less than 80 characters long, O(N) is irrelevant.

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com