Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Martin v. Löwis
 None of the functions in this PEP become part of the stable ABI.
 
 I think that's only part of the truth. This PEP can potentially have an
 impact on the stable ABI in the sense that the build-time size of
 Py_UNICODE may no longer be important for extensions that work on
 unicode buffers in the future as long as they only use the 'str' pointer
 and not 'wstr'.

Py_UNICODE isn't part of the stable ABI, so it wasn't important for
extensions using the stable ABI before - so really no change here.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Stefan Behnel

Martin v. Löwis, 29.01.2011 10:05:

None of the functions in this PEP become part of the stable ABI.


I think that's only part of the truth. This PEP can potentially have an
impact on the stable ABI in the sense that the build-time size of
Py_UNICODE may no longer be important for extensions that work on
unicode buffers in the future as long as they only use the 'str' pointer
and not 'wstr'.


Py_UNICODE isn't part of the stable ABI, so it wasn't important for
extensions using the stable ABI before - so really no change here.


I know, that's not what I meant. But this PEP would enable a C API that 
provides direct access to the underlying buffer. Just as is currently 
provided for the Py_UNICODE array, but with a stable ABI because the buffer 
type won't change based on build time options.


OTOH, one could argue that this is already partly provided by the generic 
buffer API.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Nick Coghlan
On Sat, Jan 29, 2011 at 8:00 PM, Stefan Behnel stefan...@behnel.de wrote:
 OTOH, one could argue that this is already partly provided by the generic
 buffer API.

Which won't be part of the stable ABI until 3.3 - there are some
discrepancies between PEP 3118 and the actual implementation that we
need to sort out first.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Antoine Pitrou
On Sat, 29 Jan 2011 11:00:48 +0100
Stefan Behnel stefan...@behnel.de wrote:
 
 I know, that's not what I meant. But this PEP would enable a C API that 
 provides direct access to the underlying buffer. Just as is currently 
 provided for the Py_UNICODE array, but with a stable ABI because the buffer 
 type won't change based on build time options.
 
 OTOH, one could argue that this is already partly provided by the generic 
 buffer API.

Unicode objects don't provide the buffer API (and chances are they never
will).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Stefan Behnel

Martin v. Löwis, 24.01.2011 21:17:

I'd like to propose PEP 393, which takes a different approach,
addressing both problems simultaneously: by getting a flexible
representation (one that can be either 1, 2, or 4 bytes), we can
support the full range of Unicode on all systems, but still use
only one byte per character for strings that are pure ASCII (which
will be the majority of strings for the majority of users).

You'll find the PEP at

http://www.python.org/dev/peps/pep-0393/
[...]
The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.


What about the character property functions?

http://docs.python.org/py3k/c-api/unicode.html#unicode-character-properties

Will they be adapted to accept Py_UCS4 instead of Py_UNICODE?

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Alexander Belopolsky
On Sat, Jan 29, 2011 at 12:03 PM, Stefan Behnel stefan...@behnel.de wrote:
..
 What about the character property functions?

 http://docs.python.org/py3k/c-api/unicode.html#unicode-character-properties

 Will they be adapted to accept Py_UCS4 instead of Py_UNICODE?

They have been already.  See revision 84177.  Docs should be fixed.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Florian Weimer
* Stefan Behnel:

 Martin v. Löwis, 24.01.2011 21:17:
 The Py_UNICODE type is still supported but deprecated. It is always
 defined as a typedef for wchar_t, so the wstr representation can double
 as Py_UNICODE representation.

 It's too bad this isn't initialised by default, though. Py_UNICODE is
 the only representation that can be used efficiently from C code

Is this really true?  I don't think I've seen any C API which actually
uses wchar_t, beyond that what is provided by libc.  UTF-8 and even
UTF-16 are much, much more common.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel

Florian Weimer, 28.01.2011 10:35:

* Stefan Behnel:

Martin v. Löwis, 24.01.2011 21:17:

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.


It's too bad this isn't initialised by default, though. Py_UNICODE is
the only representation that can be used efficiently from C code


Is this really true?  I don't think I've seen any C API which actually
uses wchar_t, beyond that what is provided by libc.  UTF-8 and even
UTF-16 are much, much more common.


They are also much harder to use, unless you are really only interested in 
7-bit ASCII data - which is the case for most C libraries, so I believe 
that's what you meant here. However, this is the CPython runtime with 
built-in Unicode support, not the C runtime where it comes as an add-on at 
best, and where Unicode processing without being Unicode aware is common.


The nice thing about Py_UNICODE is that is basically gives you native 
Unicode code points directly, without needing to decode UTF-8 byte runs and 
the like. In Cython, it allows you to do things like this:


def test_for_those_characters(unicode s):
for c in s:
# warning: randomly chosen Unicode escapes ahead
if c in u\u0356\u1012\u3359\u4567:
return True
else:
return False

The loop runs in plain C, using the somewhat obvious implementation with a 
loop over Py_UNICODE characters and a switch statement for the comparison. 
This would look a *lot* more ugly with UTF-8 encoded byte strings.


Regarding Cython specifically, the above will still be *possible* under the 
proposal, given that the memory layout of the strings will still represent 
the Unicode code points. It will just be trickier to implement in Cython's 
type system as there is no longer a (user visible) C type representation 
for those code units. It can be any of uchar, ushort16 or uint32, neither 
of which is necessarily a 'native' representation of a Unicode character in 
CPython. While I'm somewhat confident that I'll find a way to fix this in 
Cython, my point is just that this adds a certain level of complexity to C 
code using the new memory layout that simply wasn't there before.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Florian Weimer
* Stefan Behnel:

 The nice thing about Py_UNICODE is that is basically gives you native
 Unicode code points directly, without needing to decode UTF-8 byte
 runs and the like. In Cython, it allows you to do things like this:

 def test_for_those_characters(unicode s):
 for c in s:
 # warning: randomly chosen Unicode escapes ahead
 if c in u\u0356\u1012\u3359\u4567:
 return True
 else:
 return False

 The loop runs in plain C, using the somewhat obvious implementation
 with a loop over Py_UNICODE characters and a switch statement for the
 comparison. This would look a *lot* more ugly with UTF-8 encoded byte
 strings.

Not really, because UTF-8 is quite search-friendly.  (The if would
have to invoke a memmem()-like primitive.)  Random subscrips are
problematic.

However, why would one want to write loops like the above?  Don't you
have to take combining characters (comprising multiple codepoints)
into account most of the time when you look at individual characters?
Then UTF-32 does not offer much of a simplification.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel

Florian Weimer, 28.01.2011 15:27:

* Stefan Behnel:


The nice thing about Py_UNICODE is that is basically gives you native
Unicode code points directly, without needing to decode UTF-8 byte
runs and the like. In Cython, it allows you to do things like this:

 def test_for_those_characters(unicode s):
 for c in s:
 # warning: randomly chosen Unicode escapes ahead
 if c in u\u0356\u1012\u3359\u4567:
 return True
 else:
 return False

The loop runs in plain C, using the somewhat obvious implementation
with a loop over Py_UNICODE characters and a switch statement for the
comparison. This would look a *lot* more ugly with UTF-8 encoded byte
strings.


Not really, because UTF-8 is quite search-friendly.  (The if would
have to invoke a memmem()-like primitive.)  Random subscrips are
problematic.

However, why would one want to write loops like the above?  Don't you
have to take combining characters (comprising multiple codepoints)
into account most of the time when you look at individual characters?
Then UTF-32 does not offer much of a simplification.


Hmm, I think this discussion is pointless. Regardless of the memory layout, 
you can always go down to the byte level and use an efficient 
(multi-)substring search algorithm. (which is obviously helped if you know 
the layout at compile time *wink*)


Bad example, I guess.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Martin v. Löwis
 The nice thing about Py_UNICODE is that is basically gives you native
 Unicode code points directly, without needing to decode UTF-8 byte runs
 and the like. In Cython, it allows you to do things like this:
 
 def test_for_those_characters(unicode s):
 for c in s:
 # warning: randomly chosen Unicode escapes ahead
 if c in u\u0356\u1012\u3359\u4567:
 return True
 else:
 return False
 
 The loop runs in plain C, using the somewhat obvious implementation with
 a loop over Py_UNICODE characters and a switch statement for the
 comparison. This would look a *lot* more ugly with UTF-8 encoded byte
 strings.

And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
representation for such a loop. Instead, it should access the str
representation, and might compile this to code like

#define Cython_CharAt(data, kind, pos) kind==LATIN1 ? \
 ((unsigned char*)data)[pos] : kind==UCS2 ? \
 ((unsigned short*)data)[pos] : \
 ((Py_UCS4*)data)[pos]

 void *data = PyUnicode_Data(s);
 int kind = PyUnicode_Kind(s);
 for(int pos=0; pos  PyUnicode_Size(s); pos++){
   Py_UCS4 c = Cython_CharAt(data, kind, pos);
   Py_UCS4 tmp = {0x356, 0x1012, 0x3359, 0x4567};
   for (int k=0; k4; k++)
 if (c == tmp[k])
  return 1;
 }
 return 0;

 Regarding Cython specifically, the above will still be *possible* under
 the proposal, given that the memory layout of the strings will still
 represent the Unicode code points. It will just be trickier to implement
 in Cython's type system as there is no longer a (user visible) C type
 representation for those code units.

There is: Py_UCS4 remains available.

 It can be any of uchar, ushort16 or
 uint32, neither of which is necessarily a 'native' representation of a
 Unicode character in CPython.

There won't be a native representation anymore - that's the whole
point of the PEP.

 While I'm somewhat confident that I'll
 find a way to fix this in Cython, my point is just that this adds a
 certain level of complexity to C code using the new memory layout that
 simply wasn't there before.

Understood. However, I think it is easier than you think it is.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Josiah Carlson
Pardon me for this drive-by posting, but this thread smells a lot like this
old thread (don't be afraid to read it all, there are some good points in
there; not directed at you Martin, but at all readers/posters in this
thread)...

http://mail.python.org/pipermail/python-3000/2006-September/003795.html

http://mail.python.org/pipermail/python-3000/2006-September/003795.htmlI'm
not averse to faster and/or more memory efficient unicode representations (I
would be quite happy with them, actually). I do see the usefulness of having
non-utf-8 representations, and caching them is a good idea, though I wonder
if that is a good for Python itself to cache, or good for the application
to cache.

The evil side of me says that we should just provide an API available in
Python/C for give me the representation of unicode string X using the
2byte/4byte code points, and have it just return the appropriate
array.array() value (useful for passing to other APIs, or for those who need
to do manual manipulation of code-points), or whatever structure is deemed
to be appropriate.

The less evil side of me says that going with what the PEP offers isn't a
bad idea, and might just be a good idea.

I'll defer my vote to Martin.

Regards,
 - Josiah

On Mon, Jan 24, 2011 at 12:17 PM, Martin v. Löwis mar...@v.loewis.dewrote:

 I have been thinking about Unicode representation for some time now.
 This was triggered, on the one hand, by discussions with Glyph Lefkowitz
 (who complained that his server app consumes too much memory), and Carl
 Friedrich Bolz (who profiled Python applications to determine that
 Unicode strings are among the top consumers of memory in Python).
 On the other hand, this was triggered by the discussion on supporting
 surrogates in the library better.

 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).

 You'll find the PEP at

 http://www.python.org/dev/peps/pep-0393/

 For convenience, I include it below.

 Regards,
 Martin

 PEP: 393
 Title: Flexible String Representation
 Version: $Revision: 88168 $
 Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $
 Author: Martin v. Löwis mar...@v.loewis.de
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 24-Jan-2010
 Python-Version: 3.3
 Post-History:

 Abstract
 

 The Unicode string type is changed to support multiple internal
 representations, depending on the character with the largest Unicode
 ordinal (1, 2, or 4 bytes). This will allow a space-efficient
 representation in common cases, but give access to full UCS-4 on all
 systems. For compatibility with existing APIs, several representations
 may exist in parallel; over time, this compatibility should be phased
 out.

 Rationale
 =

 There are two classes of complaints about the current implementation
 of the unicode type: on systems only supporting UTF-16, users complain
 that non-BMP characters are not properly supported. On systems using
 UCS-4 internally (and also sometimes on systems using UCS-2), there is
 a complaint that Unicode strings take up too much memory - especially
 compared to Python 2.x, where the same code would often use ASCII
 strings (i.e. ASCII-encoded byte strings). With the proposed approach,
 ASCII-only Unicode strings will again use only one byte per character;
 while still allowing efficient indexing of strings containing non-BMP
 characters (as strings containing them will use 4 bytes per
 character).

 One problem with the approach is support for existing applications
 (e.g. extension modules). For compatibility, redundant representations
 may be computed. Applications are encouraged to phase out reliance on
 a specific internal representation if possible. As interaction with
 other libraries will often require some sort of internal
 representation, the specification choses UTF-8 as the recommended way
 of exposing strings to C code.

 For many strings (e.g. ASCII), multiple representations may actually
 share memory (e.g. the shortest form may be shared with the UTF-8 form
 if all characters are ASCII). With such sharing, the overhead of
 compatibility representations is reduced.

 Specification
 =

 The Unicode object structure is changed to this definition::

  typedef struct {
PyObject_HEAD
Py_ssize_t length;
void *str;
Py_hash_t hash;
int state;
Py_ssize_t utf8_length;
void *utf8;
Py_ssize_t wstr_length;
void *wstr;
  } PyUnicodeObject;

 These fields have the following interpretations:

 - length: number of code points in the string (result of sq_length)
 - str: shortest-form representation of the unicode string; the lower
  two bits of the 

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel

Martin v. Löwis, 28.01.2011 22:49:

And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
representation for such a loop. Instead, it should access the str
representation


Sure.



Regarding Cython specifically, the above will still be *possible* under
the proposal, given that the memory layout of the strings will still
represent the Unicode code points. It will just be trickier to implement
in Cython's type system as there is no longer a (user visible) C type
representation for those code units.


There is: Py_UCS4 remains available.


Thanks for that pointer. I had always thought that all *UCS4* names were 
platform specific and had completely missed that type. It's a lot nicer 
than Py_UNICODE because it allows users to fold surrogate pairs back into 
the character value.


It's completely missing from the docs, BTW. Google doesn't give me a single 
mention for all of docs.python.org, even though it existed at least since 
(and likely long before) Cython's oldest supported runtime Python 2.3.


If I had known about that type earlier, I could have ended up making that 
the native Unicode character type in Cython instead of bothering with 
Py_UNICODE. But this can still be changed I think. Since type inference was 
available before native Py_UNICODE support, it's unlikely that users will 
have Py_UNICODE written in their code explicitly. So I can make the switch 
under the hood.


Just to explain, a native CPython C type is much better than an arbitrary 
integer type, because it allows Cython to apply specific coercion rules 
from and to Python object types. As currently Py_UNICODE, Py_UCS4 would 
obviously coerce from and to a 1 character Unicode string, but it could 
additionally handle surrogate pair splitting and combining automatically on 
current 16-bit Unicode builds so that you'd get a Unicode string with two 
code points on coercion to Python.




While I'm somewhat confident that I'll
find a way to fix this in Cython, my point is just that this adds a
certain level of complexity to C code using the new memory layout that
simply wasn't there before.


Understood. However, I think it is easier than you think it is.


Let's see about the implications once there is an implementation.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel

Martin v. Löwis, 24.01.2011 21:17:

I have been thinking about Unicode representation for some time now.
This was triggered, on the one hand, by discussions with Glyph Lefkowitz
(who complained that his server app consumes too much memory), and Carl
Friedrich Bolz (who profiled Python applications to determine that
Unicode strings are among the top consumers of memory in Python).
On the other hand, this was triggered by the discussion on supporting
surrogates in the library better.

I'd like to propose PEP 393, which takes a different approach,
addressing both problems simultaneously: by getting a flexible
representation (one that can be either 1, 2, or 4 bytes), we can
support the full range of Unicode on all systems, but still use
only one byte per character for strings that are pure ASCII (which
will be the majority of strings for the majority of users).

You'll find the PEP at

http://www.python.org/dev/peps/pep-0393/

[...]
Stable ABI
--

None of the functions in this PEP become part of the stable ABI.


I think that's only part of the truth. This PEP can potentially have an 
impact on the stable ABI in the sense that the build-time size of 
Py_UNICODE may no longer be important for extensions that work on unicode 
buffers in the future as long as they only use the 'str' pointer and not 
'wstr'.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel

Martin v. Löwis, 24.01.2011 21:17:

I have been thinking about Unicode representation for some time now.
This was triggered, on the one hand, by discussions with Glyph Lefkowitz
(who complained that his server app consumes too much memory), and Carl
Friedrich Bolz (who profiled Python applications to determine that
Unicode strings are among the top consumers of memory in Python).
On the other hand, this was triggered by the discussion on supporting
surrogates in the library better.

I'd like to propose PEP 393, which takes a different approach,
addressing both problems simultaneously: by getting a flexible
representation (one that can be either 1, 2, or 4 bytes), we can
support the full range of Unicode on all systems, but still use
only one byte per character for strings that are pure ASCII (which
will be the majority of strings for the majority of users).

You'll find the PEP at

http://www.python.org/dev/peps/pep-0393/


After much discussion, I'm +1 for this PEP. Implementation and benchmarks 
are pending, but there are strong indicators that it will bring relief for 
the memory overhead of most applications without leading to a major 
degradation performance-wise. Not for Python code anyway, and I'll try to 
make sure Cython extensions won't notice much when switching to CPython 3.3.


Martin, this is a smart way of doing it.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Antoine Pitrou
Le mercredi 26 janvier 2011 à 21:50 -0800, Gregory P. Smith a écrit :
 
  Incidentally, to slightly reduce the overhead the unicode objects,
  there's this proposal: http://bugs.python.org/issue1943
 
 Interesting.  But that aims more at cpu performance than memory
 overhead.  What I see is programs that predominantly process ascii
 data yet waste memory on a 2-4x data explosion of the internal
 representation.  This PEP aims to address that larger target.

Right, but we should keep in mind that many unicode strings will not be
very large, and so the constant overhead of unicode objects is not
necessarily negligible.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel

Martin v. Löwis, 24.01.2011 21:17:

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.


It's too bad this isn't initialised by default, though. Py_UNICODE is the 
only representation that can be used efficiently from C code and Cython 
relies on it for fast text processing. This proposal will therefore likely 
have a pretty negative performance impact on extensions written in Cython 
as the compiler could no longer expect this representation to be available 
instantaneously.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread James Y Knight
On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:
 Martin v. Löwis, 24.01.2011 21:17:
 The Py_UNICODE type is still supported but deprecated. It is always
 defined as a typedef for wchar_t, so the wstr representation can double
 as Py_UNICODE representation.
 
 It's too bad this isn't initialised by default, though. Py_UNICODE is the 
 only representation that can be used efficiently from C code and Cython 
 relies on it for fast text processing. This proposal will therefore likely 
 have a pretty negative performance impact on extensions written in Cython as 
 the compiler could no longer expect this representation to be available 
 instantaneously.

But the whole point of the exercise is so that it doesn't have to store a 
4byte-per-char representation when a 1byte-per-char rep would do. If cython 
wants to work most efficiently with this proposal, it should learn to deal with 
the three possible raw representations.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
 So, the only criticism I have, intuitively, is that the unicode
 structure seems to become a bit too large. For example, I'm not sure you
 need a generic (pointer, size) pair in addition to the
 representation-specific ones.

It's not really a generic pointer, but rather a variable-sized pointer.
It may not fit into any of the other representations (e.g. if there is
a four-byte wchar_t, then a two-byte representation would fit neither
into the UTF-8 pointer nor into the wchar_t pointer).

 Incidentally, to slightly reduce the overhead the unicode objects,
 there's this proposal: http://bugs.python.org/issue1943

I wonder what aspects of this patch and discussion should be integrated
into the PEP. The notion of allocating the memory in the same block is
already considered in the PEP; what else might be relevant?
Input is welcome!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
 I believe the intent this pep is aiming at is for the existing in
 memory structure to be compatible with already compiled binary
 extension modules without having to recompile them or change the APIs
 they are using.

No, binary compatibility is not achieved. ABI-conforming modules will
continue to work even under this change, but only because access to the
unicode object internal representation is not available to the
restricted ABI.

 Personally I don't care at all about preserving that level of binary
 compatibility, it has been convenient in the past but is rarely the
 right thing to do.  Of course I'd personally like to see PyObject
 nuked and revisited, it is too large and is probably not cache line
 efficient.

That's a different PEP :-)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Glenn Linderman

On 1/27/2011 12:26 PM, James Y Knight wrote:

On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:

Martin v. Löwis, 24.01.2011 21:17:

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.

It's too bad this isn't initialised by default, though. Py_UNICODE is the only 
representation that can be used efficiently from C code and Cython relies on it 
for fast text processing. This proposal will therefore likely have a pretty 
negative performance impact on extensions written in Cython as the compiler 
could no longer expect this representation to be available instantaneously.

But the whole point of the exercise is so that it doesn't have to store a 
4byte-per-char representation when a 1byte-per-char rep would do. If cython 
wants to work most efficiently with this proposal, it should learn to deal with 
the three possible raw representations.


C was doing fast text processing on char long before Py_UNICODE existed, 
or wchar_t.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
 Repetition of 11; I'm guessing that the 2byte/UCS-2 should read 10,
 so that they give the width of the char representation.

Thanks, fixed.

   00 = null pointer
 
 Naturally this assumes that all pointers are at least 4-byte aligned (so
 that they can be masked off).  I assume that this is sane on every
 platform that Python supports, but should it be spelled out explicitly
 somewhere in the PEP?

I'll change the PEP to move the type indicator into the state field, so
that issue becomes irrelevant.

   The string is null-terminated (in its respective representation).
 - hash, state: same as in Python 3.2
 - utf8_length, utf8: UTF-8 representation (null-terminated)
 If this is to share its buffer with the str representation for the
 Latin-1 case, then I take it this ptr will typically be (str  ~4) ?
 i.e. only str has the low-order-bit type info.

Yes, the other pointers are aligned. Notice that the case in which
sharing occurs is only ASCII, though (for Latin-1, some characters
require two bytes in UTF-8).

 Spelling out the meaning of optional:
   does this mean that the relevant ptr is NULL; if so, if utf8 is null,
 is utf8_length undefined, or is it some dummy value?

I've clarified this: I propose length is undefined (unless there is a
good reason to clear it).

 If the string is created directly with the canonical representation
 (see below), this representation doesn't take a separate memory block,
 but is allocated right after the PyUnicodeObject struct.
 
 Is the idea to do pointer arithmentic when deleting the PyUnicodeObject
 to determine if the ptr is in that location, and not delete it if it is,
 or is there some other way of determining whether the pointers need
 deallocating?

Correct.

 If the former, is this embedding an assumption that the
 underlying allocator couldn't have allocated a buffer directly adjacent
 to the PyUnicodeObject.  I know that GNU libc's malloc/free
 implementation has gaps of two machine words between each allocation;
 off the top of my head I'm not sure if the optimized Object/obmalloc.c
 allocator enforces such gaps.

No, it doesn't... So I guess I reserve another bit in the state for that.

 GDB Debugging Hooks
 ---
 Tools/gdb/libpython.py contains debugging hooks that embed knowledge
 about the internals of CPython's data types, include PyUnicodeObject
 instances.  It will need to be slightly updated to track the change.

Thanks, added.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel

James Y Knight, 27.01.2011 21:26:

On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote:

Martin v. Löwis, 24.01.2011 21:17:

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can
double as Py_UNICODE representation.


It's too bad this isn't initialised by default, though. Py_UNICODE is
the only representation that can be used efficiently from C code and
Cython relies on it for fast text processing. This proposal will
therefore likely have a pretty negative performance impact on
extensions written in Cython as the compiler could no longer expect
this representation to be available instantaneously.


But the whole point of the exercise is so that it doesn't have to store
a 4byte-per-char representation when a 1byte-per-char rep would do.


I am well aware of that. But I'm arguing that the current simpler internal 
representation has had its advantages for CPython as a platform.




If cython wants to work most efficiently with this proposal, it should
learn to deal with the three possible raw representations.


I agree. After all, CPython is lucky to have it available. It wouldn't be 
the first time that we duplicate looping code based on the input type. 
However, like the looping code, it will also complicate all indexing code 
at runtime as it always needs to test which of the representations is 
current before it can read a character. Currently, all of this is a compile 
time decision. This will necessarily have a performance impact.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 25.01.2011 12:08, schrieb Nick Coghlan:
 On Tue, Jan 25, 2011 at 6:17 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 A new function PyUnicode_AsUTF8 is provided to access the UTF-8
 representation. It is thus identical to the existing
 _PyUnicode_AsString, which is removed. The function will compute the
 utf8 representation when first called. Since this representation will
 consume memory until the string object is released, applications
 should use the existing PyUnicode_AsUTF8String where possible
 (which generates a new string object every time). API that implicitly
 converts a string to a char* (such as the ParseTuple functions) will
 use this function to compute a conversion.
 
 I'm not entirely clear as to what this function is referring to here.

PyUnicode_AsUTF8 (i.e. the one where you don't need to release the
memory). I made this explicit now.

 I'm also dubious of the PyUnicode_Finalize name - PyUnicode_Ready
 might be a better option (PyType_Ready seems a better analogy for a
 I've filled everything in, please calculate the derived fields now
 than Py_Finalize).

Ok, changed (when I was pondering about this PEP, this once occurred
me also, but I forgot when I typed it in).

 
 More generally, let me see if I understand the proposed structure correctly:
 
 str: Always set once PyUnicode_Ready() has been called.
   Always points to the canonical representation of the string (as
 indicated by PyUnicode_Kind)
 length: Always set once PyUnicode_Ready() has been called. Specifies
 the number of code points in the string.

Correct.

 wstr: Set only if PyUnicode_AsUnicode has been called on the string.

Might also be set when the string is created through
PyUnicode_FromUnicode was used, and PyUnicode_Ready hasn't been called.

 If (sizeof(wchar_t) == 2  PyUnicode_Kind() == PyUnicode_2BYTE)
 or (sizeof(wchar_t) == 4  PyUnicode_Kind() == PyUnicode_4BYTE), wstr
 = str, otherwise wstr points to dedicated memory
 wstr_length: Valid only if wstr != NULL
 If wstr_length != length, indicates presence of surrogate pairs in
 a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
 PyUnicode_4BYTE).

Correct.

 utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
 If string contents are pure ASCII, utf8 = str, otherwise utf8
 points to dedicated memory.
 utf8_length: Valid only if utf8_ptr != NULL

Correct.

 One change I would propose is that rather than hiding flags in the low
 order bits of the str pointer, we expand the use of the existing
 state field to cover the representation information in addition to
 the interning information.

Thanks for the idea; done.

 I would also suggest explicitly flagging
 internally whether or not a 1 byte string is ASCII or Latin-1 along
 the lines of:

Not sure about that. It would complicate PyUnicode_Kind.

Instead, I'd rather fill out utf8 right away if we can use sharing
(e.g. when the string is created with a max value 128, or
PyUnicode_Ready has determined that).

So I keep it for the moment as reserved (but would use it when
str is NULL, as I'd have to fill in some value, anyway).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
From my first impression, I'm
 not too thrilled by the prospect of making the Unicode implementation
 more complicated by having three different representations on each
 object.

Thanks, added as a concern.

 I also don't see how this could save a lot of memory. As an example
 take a French text with say 10mio code points. This would end up
 appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
 one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
 on how many accents are used). That's a saving of -10MB compared to
 today's implementation :-)

As others have pointed out: that's not how it works. It actually *will*
save memory, since the alternative representations are optional.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 27.01.2011 20:06, schrieb Stefan Behnel:
 Martin v. Löwis, 24.01.2011 21:17:
 The Py_UNICODE type is still supported but deprecated. It is always
 defined as a typedef for wchar_t, so the wstr representation can double
 as Py_UNICODE representation.
 
 It's too bad this isn't initialised by default, though. Py_UNICODE is
 the only representation that can be used efficiently from C code and
 Cython relies on it for fast text processing.

That's not true. The str representation can also be used efficiently from C.

 This proposal will
 therefore likely have a pretty negative performance impact on extensions
 written in Cython as the compiler could no longer expect this
 representation to be available instantaneously.

In any case, I've added this concern.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
 I agree. After all, CPython is lucky to have it available. It wouldn't
 be the first time that we duplicate looping code based on the input
 type. However, like the looping code, it will also complicate all
 indexing code at runtime as it always needs to test which of the
 representations is current before it can read a character. Currently,
 all of this is a compile time decision. This will necessarily have a
 performance impact.

That's most certainly the case. That's one of the reasons to discuss
this through a PEP, rather than just coming up with a patch: if people
object to it too much because of the impact on execution speed, it may
get rejected. Of course, that would make those unhappy who complain
about the memory consumption.

This is a classical time-space-tradeoff, favoring space reduction
over time reduction.

I fully understand that the actual impact can only be observed when
an implementation is available, and applications have made a reasonable
effort to work with the implementation efficiently (or perhaps not,
which would show the impact on unmodified implementations).

This is something that works much better in PyPy: the actual string
operations are written in RPython, and the tracing JIT would generate
all versions of the code that are relevant for the different
representations (IIUC, this approach is only planned for PyPy, yet).

I hope that C macros can help reduce the maintenance burden.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Gregory P. Smith
BTW, has anyone looked at what other languages with a native unicode
type do for their implementations if any of them attempt to conserve
ram?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Antoine Pitrou

  Incidentally, to slightly reduce the overhead the unicode objects,
  there's this proposal: http://bugs.python.org/issue1943
 
 I wonder what aspects of this patch and discussion should be integrated
 into the PEP. The notion of allocating the memory in the same block is
 already considered in the PEP; what else might be relevant?

Ok, I'm sorry for not reading the PEP carefully enough, then.
The patch does a couple of other tweaks such as making state a char
rather than an int, and changing the freelist algorithm. But the latter
doesn't need to be spelled out in a PEP anyway.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel

Martin v. Löwis, 24.01.2011 21:17:

If the string is created directly with the canonical representation
(see below), this representation doesn't take a separate memory block,
but is allocated right after the PyUnicodeObject struct.


Does this mean it's supposed to become a PyVarObject? Antoine proposed 
that, too. Apart from breaking (more or less) all existing C subtyping 
code, this will also make it harder to subtype it in new code. I don't like 
that idea at all.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 27.01.2011 23:53, schrieb Stefan Behnel:
 Martin v. Löwis, 24.01.2011 21:17:
 If the string is created directly with the canonical representation
 (see below), this representation doesn't take a separate memory block,
 but is allocated right after the PyUnicodeObject struct.
 
 Does this mean it's supposed to become a PyVarObject?

What do you mean by become? Will it be declared as such? No.

 Antoine proposed
 that, too. Apart from breaking (more or less) all existing C subtyping
 code, this will also make it harder to subtype it in new code. I don't
 like that idea at all.

Why will it break all existing subtyping code? See the PEP: Only objects
created through PyUnicode_New will be affected - I don't think this can
include objects of a subtype.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel

Martin v. Löwis, 28.01.2011 01:02:

Am 27.01.2011 23:53, schrieb Stefan Behnel:

Martin v. Löwis, 24.01.2011 21:17:

If the string is created directly with the canonical representation
(see below), this representation doesn't take a separate memory block,
but is allocated right after the PyUnicodeObject struct.


Does this mean it's supposed to become a PyVarObject?


What do you mean by become? Will it be declared as such? No.


Antoine proposed
that, too. Apart from breaking (more or less) all existing C subtyping
code, this will also make it harder to subtype it in new code. I don't
like that idea at all.


Why will it break all existing subtyping code? See the PEP: Only objects
created through PyUnicode_New will be affected - I don't think this can
include objects of a subtype.


Ok, that's fine then.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease digitalx...@gmail.com wrote:
 On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg m...@egenix.com wrote:
 I also don't see how this could save a lot of memory. As an example
 take a French text with say 10mio code points. This would end up
 appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
 one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
 on how many accents are used). That's a saving of -10MB compared to
 today's implementation :-)

 If I am reading the pep right, which I may not be as I am no expert on
 unicode, the new implementation would actually give a 10MB saving
 since the wchar field is optional, so only the str (Latin-1) and utf8
 fields would need to be stored. How it decides not to store one field
 or another would need to be clarified in the pep is I am right.

The PEP actually does define that already:

PyUnicode_AsUTF8 populates the utf8 field of the existing string,
while PyUnicode_AsUTF8String creates a *new* string with that field
populated.

PyUnicode_AsUnicode will populate the wstr field (but doing so
generally shouldn't be necessary).

For a UCS4 build, my reading of the PEP puts the memory savings for a
100 code point string as follows:

Current size: 400 bytes (regardless of max code point)

New initial size (max code point  256): 100 bytes (75% saving)
New initial size (max code point  65536): 200 bytes (50% saving)
New initial size (max code point = 65536): 400 bytes (no saving)

For each of the new strings, they may consume additional storage if
the utf8 or wstr fields get populated. The maximum possible size would
be a UCS4 string (max code point = 65536) on a sizeof(wchar_t) == 2
system with the utf8 string populated. In such cases, you would
consume at least 700 bytes, plus whatever additional memory is needed
to encode the non-BMP characters into UTF-8 and UTF-16.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Paul Moore
On 26 January 2011 12:30, Nick Coghlan ncogh...@gmail.com wrote:
 The PEP actually does define that already:

 PyUnicode_AsUTF8 populates the utf8 field of the existing string,
 while PyUnicode_AsUTF8String creates a *new* string with that field
 populated.

 PyUnicode_AsUnicode will populate the wstr field (but doing so
 generally shouldn't be necessary).

AIUI, another point is that the PEP deprecates the use of the calls
that populate the utf8 and wstr fields, in favour of the calls that
expect the caller to manage the extra memory (PyUnicode_AsUTF8String
rather than PyUnicode_AsUTF8, ??? rather than PyUnicode_AsUnicode). So
in the long term, the extra fields should never be populated -
although this could take some time as extensions have to be recoded.
Ultimately, the extra fields and older APIs could even be removed.

So any space cost (which I concede could be non-trivial in some cases)
is expected to be short-term.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Gregory P. Smith
On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou solip...@pitrou.net wrote:
 Le mardi 25 janvier 2011 à 00:07 +0100, Martin v. Löwis a écrit :
  I'd like to propose PEP 393, which takes a different approach,
  addressing both problems simultaneously: by getting a flexible
  representation (one that can be either 1, 2, or 4 bytes), we can
  support the full range of Unicode on all systems, but still use
  only one byte per character for strings that are pure ASCII (which
  will be the majority of strings for the majority of users).
 
  For this kind of experiment, I think a concrete attempt at implementing
  (together with performance/memory savings numbers) would be much more
  useful than an abstract proposal.

 I partially agree. An implementation is certainly needed, but there is
 nothing wrong (IMO) with designing the change before implementing it.
 Also, several people have offered to help with the implementation, so
 we need to agree on a specification first (which is actually cheaper
 than starting with the implementation only to find out that people
 misunderstood each other).

 I'm not sure it's really cheaper. When implementing you will probably
 find out that it makes more sense to change the meaning of some fields,
 add or remove some, etc. You will also want to try various tweaks since
 the whole point is to lighten the footprint of unicode strings in common
 workloads.

Yep.  This is only a proposal, an implementation will allow all of
that to be experimented with.

I have frequently see code today, even in python 2.x, that suffers
greatly from unicode vs string use (due to APIs in some code that were
returning unicode objects unnecessarily when the data was really all
ascii text).  python 3.x only increases this as the default for so
many things passes through unicode even for programs that may not need
it.


 So, the only criticism I have, intuitively, is that the unicode
 structure seems to become a bit too large. For example, I'm not sure you
 need a generic (pointer, size) pair in addition to the
 representation-specific ones.

I believe the intent this pep is aiming at is for the existing in
memory structure to be compatible with already compiled binary
extension modules without having to recompile them or change the APIs
they are using.

Personally I don't care at all about preserving that level of binary
compatibility, it has been convenient in the past but is rarely the
right thing to do.  Of course I'd personally like to see PyObject
nuked and revisited, it is too large and is probably not cache line
efficient.


 Incidentally, to slightly reduce the overhead the unicode objects,
 there's this proposal: http://bugs.python.org/issue1943

Interesting.  But that aims more at cpu performance than memory
overhead.  What I see is programs that predominantly process ascii
data yet waste memory on a 2-4x data explosion of the internal
representation.  This PEP aims to address that larger target.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Nick Coghlan
On Tue, Jan 25, 2011 at 6:17 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 A new function PyUnicode_AsUTF8 is provided to access the UTF-8
 representation. It is thus identical to the existing
 _PyUnicode_AsString, which is removed. The function will compute the
 utf8 representation when first called. Since this representation will
 consume memory until the string object is released, applications
 should use the existing PyUnicode_AsUTF8String where possible
 (which generates a new string object every time). API that implicitly
 converts a string to a char* (such as the ParseTuple functions) will
 use this function to compute a conversion.

I'm not entirely clear as to what this function is referring to here.

I'm also dubious of the PyUnicode_Finalize name - PyUnicode_Ready
might be a better option (PyType_Ready seems a better analogy for a
I've filled everything in, please calculate the derived fields now
than Py_Finalize).

More generally, let me see if I understand the proposed structure correctly:

str: Always set once PyUnicode_Ready() has been called.
  Always points to the canonical representation of the string (as
indicated by PyUnicode_Kind)
length: Always set once PyUnicode_Ready() has been called. Specifies
the number of code points in the string.

wstr: Set only if PyUnicode_AsUnicode has been called on the string.
If (sizeof(wchar_t) == 2  PyUnicode_Kind() == PyUnicode_2BYTE)
or (sizeof(wchar_t) == 4  PyUnicode_Kind() == PyUnicode_4BYTE), wstr
= str, otherwise wstr points to dedicated memory
wstr_length: Valid only if wstr != NULL
If wstr_length != length, indicates presence of surrogate pairs in
a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
PyUnicode_4BYTE).

utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
If string contents are pure ASCII, utf8 = str, otherwise utf8
points to dedicated memory.
utf8_length: Valid only if utf8_ptr != NULL

One change I would propose is that rather than hiding flags in the low
order bits of the str pointer, we expand the use of the existing
state field to cover the representation information in addition to
the interning information. I would also suggest explicitly flagging
internally whether or not a 1 byte string is ASCII or Latin-1 along
the lines of:

/* Already existing string state constants */
#SSTATE_NOT_INTERNED 0x00
#SSTATE_INTERNED_MORTAL 0x01
#SSTATE_INTERNED_IMMORTAL 0x02
/* New string state constants */
#SSTATE_INTERN_MASK 0x03
#SSTATE_KIND_ASCII 0x00
#SSTATE_KIND_LATIN1 0x04
#SSTATE_KIND_2BYTE 0x08
#SSTATE_KIND_4BYTE 0x0C
#SSTATE_KIND_MASK 0x0C


PyUnicode_Kind would then return PyUnicode_1BYTE for strings that were
flagged internally as either ASCII or LATIN1.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread M.-A. Lemburg
I'll comment more on this later this week...

From my first impression, I'm
not too thrilled by the prospect of making the Unicode implementation
more complicated by having three different representations on each
object.

I also don't see how this could save a lot of memory. As an example
take a French text with say 10mio code points. This would end up
appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
on how many accents are used). That's a saving of -10MB compared to
today's implementation :-)

Martin v. Löwis wrote:
 I have been thinking about Unicode representation for some time now.
 This was triggered, on the one hand, by discussions with Glyph Lefkowitz
 (who complained that his server app consumes too much memory), and Carl
 Friedrich Bolz (who profiled Python applications to determine that
 Unicode strings are among the top consumers of memory in Python).
 On the other hand, this was triggered by the discussion on supporting
 surrogates in the library better.
 
 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).
 
 You'll find the PEP at
 
 http://www.python.org/dev/peps/pep-0393/
 
 For convenience, I include it below.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 25 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou

For the record:

 I also don't see how this could save a lot of memory. As an example
 take a French text with say 10mio code points. This would end up
 appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
 one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
 on how many accents are used).

Typical French text seems to have 5% non-ASCII characters. So the
number of UTF-8 bytes needed to represent a French text would only be
5% higher than the number of code points.

Anyway, it's quite obvious that Martin's goal is that only one
representation gets created most of the time. To quote the draft:

“All three representations are optional, although the str form is
considered the canonical representation which can be absent only
while the string is being created.”

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou
On Tue, 25 Jan 2011 21:08:01 +1000
Nick Coghlan ncogh...@gmail.com wrote:
 
 One change I would propose is that rather than hiding flags in the low
 order bits of the str pointer, we expand the use of the existing
 state field to cover the representation information in addition to
 the interning information.

+1, by the way. The state field has many bits available (even if we
decide to make it a char rather than an int).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Dj Gilcrease
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg m...@egenix.com wrote:
 I also don't see how this could save a lot of memory. As an example
 take a French text with say 10mio code points. This would end up
 appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
 one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
 on how many accents are used). That's a saving of -10MB compared to
 today's implementation :-)

If I am reading the pep right, which I may not be as I am no expert on
unicode, the new implementation would actually give a 10MB saving
since the wchar field is optional, so only the str (Latin-1) and utf8
fields would need to be stored. How it decides not to store one field
or another would need to be clarified in the pep is I am right.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Martin v. Löwis
I have been thinking about Unicode representation for some time now.
This was triggered, on the one hand, by discussions with Glyph Lefkowitz
(who complained that his server app consumes too much memory), and Carl
Friedrich Bolz (who profiled Python applications to determine that
Unicode strings are among the top consumers of memory in Python).
On the other hand, this was triggered by the discussion on supporting
surrogates in the library better.

I'd like to propose PEP 393, which takes a different approach,
addressing both problems simultaneously: by getting a flexible
representation (one that can be either 1, 2, or 4 bytes), we can
support the full range of Unicode on all systems, but still use
only one byte per character for strings that are pure ASCII (which
will be the majority of strings for the majority of users).

You'll find the PEP at

http://www.python.org/dev/peps/pep-0393/

For convenience, I include it below.

Regards,
Martin

PEP: 393
Title: Flexible String Representation
Version: $Revision: 88168 $
Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $
Author: Martin v. Löwis mar...@v.loewis.de
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 24-Jan-2010
Python-Version: 3.3
Post-History:

Abstract


The Unicode string type is changed to support multiple internal
representations, depending on the character with the largest Unicode
ordinal (1, 2, or 4 bytes). This will allow a space-efficient
representation in common cases, but give access to full UCS-4 on all
systems. For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out.

Rationale
=

There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification choses UTF-8 as the recommended way
of exposing strings to C code.

For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced.

Specification
=

The Unicode object structure is changed to this definition::

  typedef struct {
PyObject_HEAD
Py_ssize_t length;
void *str;
Py_hash_t hash;
int state;
Py_ssize_t utf8_length;
void *utf8;
Py_ssize_t wstr_length;
void *wstr;
  } PyUnicodeObject;

These fields have the following interpretations:

- length: number of code points in the string (result of sq_length)
- str: shortest-form representation of the unicode string; the lower
  two bits of the pointer indicate the specific form:
  01 = 1 byte (Latin-1); 11 = 2 byte (UCS-2); 11 = 4 byte (UCS-4);
  00 = null pointer

  The string is null-terminated (in its respective representation).
- hash, state: same as in Python 3.2
- utf8_length, utf8: UTF-8 representation (null-terminated)
- wstr_length, wstr: representation in platform's wchar_t
  (null-terminated). If wchar_t is 16-bit, this form may use surrogate
  pairs (in which cast wstr_length differs form length).

All three representations are optional, although the str form is
considered the canonical representation which can be absent only
while the string is being created.

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.

The str and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient). The str
and wstr pointers point to the same memory if the string happens to
fit exactly to the wchar_t type of the platform (i.e. uses some
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
non-BMP characters if sizeof(wchar_t) is 4).

If the string is created directly with the canonical representation
(see below), this representation doesn't take a separate memory block,
but is allocated right after the PyUnicodeObject 

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Antoine Pitrou
On Mon, 24 Jan 2011 21:17:34 +0100
Martin v. Löwis mar...@v.loewis.de wrote:
 I have been thinking about Unicode representation for some time now.
 This was triggered, on the one hand, by discussions with Glyph Lefkowitz
 (who complained that his server app consumes too much memory), and Carl
 Friedrich Bolz (who profiled Python applications to determine that
 Unicode strings are among the top consumers of memory in Python).
 On the other hand, this was triggered by the discussion on supporting
 surrogates in the library better.
 
 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).

For this kind of experiment, I think a concrete attempt at implementing
(together with performance/memory savings numbers) would be much more
useful than an abstract proposal. It is hard to judge the concrete
effects of the changes you are proposing, even though they might (or
not) make sense in theory. For example, you are adding a lot of
constant overhead to every unicode object, even very small ones, which
might be detrimental. Also, accessing the unicode object's payload
can become quite a bit more cumbersome. Only implementing can tell how
much this is workable in practice.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Martin v. Löwis
 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).
 
 For this kind of experiment, I think a concrete attempt at implementing
 (together with performance/memory savings numbers) would be much more
 useful than an abstract proposal.

I partially agree. An implementation is certainly needed, but there is
nothing wrong (IMO) with designing the change before implementing it.
Also, several people have offered to help with the implementation, so
we need to agree on a specification first (which is actually cheaper
than starting with the implementation only to find out that people
misunderstood each other).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Antoine Pitrou
Le mardi 25 janvier 2011 à 00:07 +0100, Martin v. Löwis a écrit :
  I'd like to propose PEP 393, which takes a different approach,
  addressing both problems simultaneously: by getting a flexible
  representation (one that can be either 1, 2, or 4 bytes), we can
  support the full range of Unicode on all systems, but still use
  only one byte per character for strings that are pure ASCII (which
  will be the majority of strings for the majority of users).
  
  For this kind of experiment, I think a concrete attempt at implementing
  (together with performance/memory savings numbers) would be much more
  useful than an abstract proposal.
 
 I partially agree. An implementation is certainly needed, but there is
 nothing wrong (IMO) with designing the change before implementing it.
 Also, several people have offered to help with the implementation, so
 we need to agree on a specification first (which is actually cheaper
 than starting with the implementation only to find out that people
 misunderstood each other).

I'm not sure it's really cheaper. When implementing you will probably
find out that it makes more sense to change the meaning of some fields,
add or remove some, etc. You will also want to try various tweaks since
the whole point is to lighten the footprint of unicode strings in common
workloads.

So, the only criticism I have, intuitively, is that the unicode
structure seems to become a bit too large. For example, I'm not sure you
need a generic (pointer, size) pair in addition to the
representation-specific ones.

Incidentally, to slightly reduce the overhead the unicode objects,
there's this proposal: http://bugs.python.org/issue1943

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread David Malcolm
On Mon, 2011-01-24 at 21:17 +0100, Martin v. Löwis wrote:

... snip ...

 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).

There was some discussion about this at PyCon 2010, where we referred to
it casually as Pay-as-you-go unicode

... snip ...

 - str: shortest-form representation of the unicode string; the lower
   two bits of the pointer indicate the specific form:
   01 = 1 byte (Latin-1); 11 = 2 byte (UCS-2); 11 = 4 byte (UCS-4);
Repetition of 11; I'm guessing that the 2byte/UCS-2 should read 10,
so that they give the width of the char representation.

   00 = null pointer

Naturally this assumes that all pointers are at least 4-byte aligned (so
that they can be masked off).  I assume that this is sane on every
platform that Python supports, but should it be spelled out explicitly
somewhere in the PEP?

 
   The string is null-terminated (in its respective representation).
 - hash, state: same as in Python 3.2
 - utf8_length, utf8: UTF-8 representation (null-terminated)
If this is to share its buffer with the str representation for the
Latin-1 case, then I take it this ptr will typically be (str  ~4) ?
i.e. only str has the low-order-bit type info.

 - wstr_length, wstr: representation in platform's wchar_t
   (null-terminated). If wchar_t is 16-bit, this form may use surrogate
   pairs (in which cast wstr_length differs form length).
 
 All three representations are optional, although the str form is
 considered the canonical representation which can be absent only
 while the string is being created.

Spelling out the meaning of optional:
  does this mean that the relevant ptr is NULL; if so, if utf8 is null,
is utf8_length undefined, or is it some dummy value?  (i.e. is the
pointer the first thing to check before we know if utf8_length is
meaningful?); similar consideration for the wstr representation.


 The Py_UNICODE type is still supported but deprecated. It is always
 defined as a typedef for wchar_t, so the wstr representation can double
 as Py_UNICODE representation.
 
 The str and utf8 pointers point to the same memory if the string uses
 only ASCII characters (using only Latin-1 is not sufficient). The str
...though the ptrs are non-equal for this case, as noted above, as str
has an 0x1 typecode.

 and wstr pointers point to the same memory if the string happens to
 fit exactly to the wchar_t type of the platform (i.e. uses some
 BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
 non-BMP characters if sizeof(wchar_t) is 4).
 
 If the string is created directly with the canonical representation
 (see below), this representation doesn't take a separate memory block,
 but is allocated right after the PyUnicodeObject struct.

Is the idea to do pointer arithmentic when deleting the PyUnicodeObject
to determine if the ptr is in that location, and not delete it if it is,
or is there some other way of determining whether the pointers need
deallocating?  If the former, is this embedding an assumption that the
underlying allocator couldn't have allocated a buffer directly adjacent
to the PyUnicodeObject.  I know that GNU libc's malloc/free
implementation has gaps of two machine words between each allocation;
off the top of my head I'm not sure if the optimized Object/obmalloc.c
allocator enforces such gaps.

... snip ...

Extra section:

GDB Debugging Hooks
---
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
about the internals of CPython's data types, include PyUnicodeObject
instances.  It will need to be slightly updated to track the change.

(I can do that change if need be; it shouldn't be too hard).



Hope this is helpful
Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com