subject:"\[Python\-Dev\] New Py

Re: [Python-Dev] New Py_UNICODE doc

2005-05-10 Thread Nicholas Bastin


On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:

 Wow, what an inane way of looking at it.  I don't know what world you
 live in, but in my world, users read the configure options and suppose
 that they mean something.  In fact, they *have* to go off on their own
 to assume something, because even the documentation you refer to above
 doesn't say what happens if they choose UCS-2 or UCS-4.  A logical
 assumption would be that python would use those CEFs internally, and
 that would be incorrect.

 Certainly. That's why the documentation should be improved. Changing
 the option breaks existing packaging systems, and should not be done
 lightly.

I'm perfectly happy to continue supporting --enable-unicode=ucs2, but 
not displaying it as an option.  Is that acceptable to you?

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-10 Thread Martin v. Löwis

M.-A. Lemburg wrote:
 If all you're interested in is the lexical class of the code points
 in a string, you could use such a codec to map each code point
 to a code point representing the lexical class.

How can I efficiently implement such a codec? The whole point is doing
that in pure Python (because if I had to write an extension module,
I could just as well do the entire lexical analysis in C, without
any regular expressions).

Any kind of associative/indexed table for this task consumes a lot
of memory, and takes quite some time to initialize.

Regards,
Martint
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-10 Thread James Y Knight


On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote:
 On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:


 Wow, what an inane way of looking at it.  I don't know what world  
 you
 live in, but in my world, users read the configure options and  
 suppose
 that they mean something.  In fact, they *have* to go off on  
 their own
 to assume something, because even the documentation you refer to  
 above
 doesn't say what happens if they choose UCS-2 or UCS-4.  A logical
 assumption would be that python would use those CEFs internally, and
 that would be incorrect.


 Certainly. That's why the documentation should be improved. Changing
 the option breaks existing packaging systems, and should not be done
 lightly.


 I'm perfectly happy to continue supporting --enable-unicode=ucs2,  
 but not displaying it as an option.  Is that acceptable to you?


If you're going to call python's implementation UTF-16, I'd consider  
all these very serious deficiencies:
- unicodedata doesn't work for 2-char strings containing a surrogate  
pairs, nor integers. Therefore it is impossible to get any data on  
chars  0x.
- there are no methods for determining if something is a surrogate  
pair and turning it into a integer codepoint.
- Given that unicodedata doesn't work, I doubt also that .toupper/etc  
work right on surrogate pairs, although I haven't tested.
- As has been noted before, the regexp engine doesn't properly treat  
surrogate pairs as a single unit.
- Is there a method that is like unichr but that will work for  
codepoints  0x?

I'm sure there's more as well. I think it's a mistake to consider  
python to be implementing UTF-16 just because it properly encodes/ 
decodes surrogate pairs in the UTF-8 codec.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-09 Thread Martin v. Löwis

M.-A. Lemburg wrote:
 On sre character classes: I don't think that these provide
 a good approach to XML lexical classes - custom functions
 or methods or maybe even a codec mapping the characters
 to their XML lexical class are much more efficient in
 practice.

That isn't my experience: functions that scan XML strings
are much slower than regular expressions. I can't imagine
how a custom codec could work, so I cannot comment on that.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Martin v. Löwis

M.-A. Lemburg wrote:
 I believe that it would be more appropriate to adjust the _tkinter
 module to adapt to the TCL Unicode size rather than
 forcing the complete Python system to adapt to TCL - I don't
 really see the point in an optional extension module
 defining the default for the interpreter core.

_tkinter currently supports, for a UCS-2 Tcl, both UCS-2 and UCS-4
Python. For an UCS-4 Tcl, it requires Python also to be UCS-4.
Contributions to support the missing case are welcome.

 At the very least, this should be a user controlled option.

It is: by passing --enable-unicode=ucs2, you can force Python
to use UCS-2 even if Tcl is UCS-4, with the result that
_tkinter cannot be built anymore (and compilation fails
with an #error).

 Otherwise, we might as well use sizeof(wchar_t) as basis
 for the default Unicode size. This at least, would be
 a much more reasonable choice than whatever TCL uses.

The goal of the build process is to provide as many extension
modules as possible (given the set of headers and libraries
installed), and _tkinter is an important extension module
because IDLE depends on it.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Martin v. Löwis

Nicholas Bastin wrote:
 -1. This breaks existing documentation and usage, and provides only
 minimum value.
 
 
 Have you been missing this conversation?  UTF-16 is *WHAT PYTHON
 CURRENTLY IMPLEMENTS*.  The current documentation is flat out wrong. 
 Breaking that isn't a big problem in my book.

The documentation I refer to is the one that says the equivalent of

'configure takes an option --enable-unicode, with the possible
values ucs2, ucs4, yes (equivalent to no argument),
and  no (equivalent to --disable-unicode)'

*THIS* documentation would break. This documentation is factually
correct at the moment (configure does indeed take these options),
and people rely on them in automatic build processes. Changing
configure options should not be taken lightly, even if they
may result from a wrong mental model. By that rule, --with-suffix
should be renamed to --enable-suffix, --with-doc-strings to
--enable-doc-strings, and so on. However, the nitpicking that
underlies the desire to rename the option should be ignored
in favour of backwards compatibility.

Changing the documentation that goes along with the option
would be fine.

 It provides more than minimum value - it provides the truth.

No. It is just a command line option. It could be named
--enable-quirk=(quork|quark), and would still select UTF-16.
Command line options provide no truth - they don't even
provide statements.

 With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start
 supporting the full Unicode ccs the same way it supports UCS-2.
 
 I can't understand what you mean by this.  My point is that if you
 configure python to support UCS-2, then it SHOULD NOT support surrogate
 pairs.  Supporting surrogate paris is the purvey of variable width
 encodings, and UCS-2 is not among them.

So you suggest to renaming it to --enable-unicode=utf16, right?
My point is that a Unicode type with UTF-16 would correctly
support all assigned Unicode code points, which the current
2-byte implementation doesn't. So --enable-unicode=utf16 would
*not* be the truth.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Martin v. Löwis

Nicholas Bastin wrote:
 All of my proposals for what to change the documention to have been 
 shot down by Martin.  If someone has better verbiage that they'd like 
 to see, I'd be perfectly happy to patch the doc.

I don't look into the specific wording - you speak English much better
than I do. What I care about is that this part of the documentation
should be complete and precise. I.e. statements like should not make
assumptions might be fine, as long as they are still followed by
a precise description of what the code currently does. So it should
mention that the representation can be either 2 or 4 bytes, that
the strings ucs2 and ucs4 can be used to select one of them,
that it is always 2 bytes on Windows, that 2 bytes means that non-BMP
characters can be represented as surrogate pairs, and so on.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Shane Hathaway

M.-A. Lemburg wrote:
 All this talk about UTF-16 vs. UCS-2 is not very useful
 and strikes me a purely academic.
 
 The reference to possibly breakage by slicing a Unicode and
 breaking a surrogate pair is valid, the idea of UCS-4 being
 less prone to breakage is a myth:

Fair enough.  The original point is that the documentation is unclear
about what a Py_UNICODE[] contains.  I deduced that it contains either
UCS2 or UCS4 and implemented accordingly.  Not only did I guess wrong,
but others will probably guess wrong too.  Something in the docs needs
to spell this out.

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Martin v. Löwis

Shane Hathaway wrote:
 Fair enough.  The original point is that the documentation is unclear
 about what a Py_UNICODE[] contains.  I deduced that it contains either
 UCS2 or UCS4 and implemented accordingly.  Not only did I guess wrong,
 but others will probably guess wrong too.  Something in the docs needs
 to spell this out.

Again, patches are welcome. I was opposed to Nick's proposed changes,
since they explicitly said that you are not supposed to know what
is in a Py_UNICODE. Integrating the essence of PEP 261 into the
main documentation would be a worthwhile task.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Nicholas Bastin


On May 8, 2005, at 5:15 AM, Martin v. Löwis wrote:

 'configure takes an option --enable-unicode, with the possible
 values ucs2, ucs4, yes (equivalent to no argument),
 and  no (equivalent to --disable-unicode)'

 *THIS* documentation would break. This documentation is factually
 correct at the moment (configure does indeed take these options),
 and people rely on them in automatic build processes. Changing
 configure options should not be taken lightly, even if they
 may result from a wrong mental model. By that rule, --with-suffix
 should be renamed to --enable-suffix, --with-doc-strings to
 --enable-doc-strings, and so on. However, the nitpicking that
 underlies the desire to rename the option should be ignored
 in favour of backwards compatibility.

 Changing the documentation that goes along with the option
 would be fine.

That is exactly what I proposed originally, which you shot down.  
Please actually read the contents of my messages.  What I said was 
change the configure option and related documentation.


 It provides more than minimum value - it provides the truth.

 No. It is just a command line option. It could be named
 --enable-quirk=(quork|quark), and would still select UTF-16.
 Command line options provide no truth - they don't even
 provide statements.

Wow, what an inane way of looking at it.  I don't know what world you 
live in, but in my world, users read the configure options and suppose 
that they mean something.  In fact, they *have* to go off on their own 
to assume something, because even the documentation you refer to above 
doesn't say what happens if they choose UCS-2 or UCS-4.  A logical 
assumption would be that python would use those CEFs internally, and 
that would be incorrect.

 With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start
 supporting the full Unicode ccs the same way it supports UCS-2.

 I can't understand what you mean by this.  My point is that if you
 configure python to support UCS-2, then it SHOULD NOT support 
 surrogate
 pairs.  Supporting surrogate paris is the purvey of variable width
 encodings, and UCS-2 is not among them.

 So you suggest to renaming it to --enable-unicode=utf16, right?
 My point is that a Unicode type with UTF-16 would correctly
 support all assigned Unicode code points, which the current
 2-byte implementation doesn't. So --enable-unicode=utf16 would
 *not* be the truth.

The current implementation supports the UTF-16 CEF.  i.e., it supports 
a variable width encoding form capable of representing all of the 
unicode space using surrogate pairs.  Please point out a code point 
that the current 2 byte implementation does not support, either 
directly, or through the use of surrogate pairs.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-08 Thread Nicholas Bastin


On May 8, 2005, at 1:44 PM, Martin v. Löwis wrote:

 Shane Hathaway wrote:
 Fair enough.  The original point is that the documentation is unclear
 about what a Py_UNICODE[] contains.  I deduced that it contains either
 UCS2 or UCS4 and implemented accordingly.  Not only did I guess wrong,
 but others will probably guess wrong too.  Something in the docs needs
 to spell this out.

 Again, patches are welcome. I was opposed to Nick's proposed changes,
 since they explicitly said that you are not supposed to know what
 is in a Py_UNICODE. Integrating the essence of PEP 261 into the
 main documentation would be a worthwhile task.

You can't possibly assume you know specifically what's in a Py_UNICODE 
in any given python installation.  If someone thinks this statement is 
untrue, please explain why.

I realize you might not *want* that to be true, but it is.  Users are 
free to configure their python however they desire, and if that means 
--enable-unicode=ucs2 on RH9, then that is perfectly valid.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread Shane Hathaway

Martin v. Löwis wrote:
 Define correctly. Python, in ucs2 mode, will allow to address individual
 surrogate codes, e.g. in indexing. So you get
 
 
u\U00012345[0]

When Python encodes characters internally in UCS-2, I would expect
u\U00012345 to produce a UnicodeError(character can not be encoded in
UCS-2).

 u'\ud808'
 
 This will never work correctly, and never should, because an efficient
 implementation isn't possible. If you want safe indexing and slicing,
 you need ucs4.

I agree that UCS4 is needed.  There is a balancing act here; UTF-16 is
widely used and takes less space, while UCS4 is easier to treat as an
array of characters.  Maybe we can have both: unicode objects start with
an internal representation in UTF-16, but get promoted automatically to
UCS4 when you index or slice them.  The difference will not be visible
to Python code.  A compile-time switch will not be necessary.  What do
you think?

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread Shane Hathaway

Martin v. Löwis wrote:
 Shane Hathaway wrote:
 
I agree that UCS4 is needed.  There is a balancing act here; UTF-16 is
widely used and takes less space, while UCS4 is easier to treat as an
array of characters.  Maybe we can have both: unicode objects start with
an internal representation in UTF-16, but get promoted automatically to
UCS4 when you index or slice them.  The difference will not be visible
to Python code.  A compile-time switch will not be necessary.  What do
you think?
 
 
 This breaks backwards compatibility with existing extension modules.
 Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and
 can use that to directly access the characters.

Py_UNICODE would always be 32 bits wide.  PyUnicode_AsUnicode would
cause the unicode object to be promoted automatically.  Extensions that
break as a result are technically broken already, aren't they?  They're
not supposed to depend on the size of Py_UNICODE.

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread Martin v. Löwis

Shane Hathaway wrote:
 Py_UNICODE would always be 32 bits wide.

This would break PythonWin, which relies on Py_UNICODE being
the same as WCHAR_T. PythonWin is not broken, it just hasn't
been ported to UCS-4, yet (and porting this is difficult and
will cause a performance loss).

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread M.-A. Lemburg

Shane Hathaway wrote:
 Martin v. Löwis wrote:
 
Shane Hathaway wrote:


I agree that UCS4 is needed.  There is a balancing act here; UTF-16 is
widely used and takes less space, while UCS4 is easier to treat as an
array of characters.  Maybe we can have both: unicode objects start with
an internal representation in UTF-16, but get promoted automatically to
UCS4 when you index or slice them.  The difference will not be visible
to Python code.  A compile-time switch will not be necessary.  What do
you think?


This breaks backwards compatibility with existing extension modules.
Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and
can use that to directly access the characters.
 
 
 Py_UNICODE would always be 32 bits wide.  PyUnicode_AsUnicode would
 cause the unicode object to be promoted automatically.  Extensions that
 break as a result are technically broken already, aren't they?  They're
 not supposed to depend on the size of Py_UNICODE.

-1.

You are free to compile Python with --enable-unicode=ucs4
if you prefer this setting.

I don't see any reason why we should force users to invest 4 bytes
of storage for each Unicode code point - 2 bytes work just fine
and can represent all Unicode characters that are currently
defined (using surrogates if necessary). As more and more
Unicode objects are used in a process, choosing UCS2 vs. UCS4
does make a huge difference in terms of used memory.

All this talk about UTF-16 vs. UCS-2 is not very useful
and strikes me a purely academic.

The reference to possibly breakage by slicing a Unicode and
breaking a surrogate pair is valid, the idea of UCS-4 being
less prone to breakage is a myth:

Unicode has many code points that are meant only for composition
and don't have any standalone meaning, e.g. a combining acute
accent (U+0301), yet they are perfectly valid code points -
regardless of UCS-2 or UCS-4. It is easily possible to break
such a combining sequence using slicing, so the most
often presented argument for using UCS-4 instead of UCS-2
(+ surrogates) is rather weak if seen by daylight.

Some may now say that combining sequences are not used
all that often. However, they play a central role in Unicode
normalization (http://www.unicode.org/reports/tr15/),
which is needed whenever you want to semantically
compare Unicode objects and are

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread M.-A. Lemburg

Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 
Hmm, looking at the configure.in script, it seems you're right.
I wonder why this weird dependency on TCL was added.
 
 
 If Python is configured for UCS-2, and Tcl for UCS-4, then
 Tkinter would not work out of the box. Hence the weird dependency.

I believe that it would be more appropriate to adjust the _tkinter
module to adapt to the TCL Unicode size rather than
forcing the complete Python system to adapt to TCL - I don't
really see the point in an optional extension module
defining the default for the interpreter core.

At the very least, this should be a user controlled option.

Otherwise, we might as well use sizeof(wchar_t) as basis
for the default Unicode size. This at least, would be
a much more reasonable choice than whatever TCL uses.

-
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread Nicholas Bastin


On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 --enable-unicode=ucs2

 be replaced with:

 --enable-unicode=utf16

 and the docs be updated to reflect more accurately the variance of the
 internal storage type.

 -1. This breaks existing documentation and usage, and provides only
 minimum value.

Have you been missing this conversation?  UTF-16 is *WHAT PYTHON 
CURRENTLY IMPLEMENTS*.  The current documentation is flat out wrong.  
Breaking that isn't a big problem in my book.

It provides more than minimum value - it provides the truth.


 With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start
 supporting the full Unicode ccs the same way it supports UCS-2.
 Individual surrogate values remain accessible, and supporting
 non-BMP characters is left to the application (with the exception
 of the UTF-8 codec).

I can't understand what you mean by this.  My point is that if you 
configure python to support UCS-2, then it SHOULD NOT support surrogate 
pairs.  Supporting surrogate paris is the purvey of variable width 
encodings, and UCS-2 is not among them.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread M.-A. Lemburg

Nicholas Bastin wrote:
 On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:
With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start
supporting the full Unicode ccs the same way it supports UCS-2.
Individual surrogate values remain accessible, and supporting
non-BMP characters is left to the application (with the exception
of the UTF-8 codec).
 
 I can't understand what you mean by this.  My point is that if you 
 configure python to support UCS-2, then it SHOULD NOT support surrogate 
 pairs.  Supporting surrogate paris is the purvey of variable width 
 encodings, and UCS-2 is not among them.

Surrogate pairs are only supported by the UTF-8 and UTF-16
codecs (and a few others), not the Python Unicode
implementation itself - this treats surrogate code
points just like any other Unicode code point.

This allows us to be flexible and efficient in the implementation
while guaranteeing the round-trip safety of Unicode data processed
through Python.

Your complaint about the documentation (which started this
thread) is valid.

However, I don't understand all the excitement
about Py_UNICODE: if you don't like the way this Python
typedef works, you are free to interface to Python using
any of the supported encodings using PyUnicode_Encode()
and PyUnicode_Decode(). I'm sure you'll find one that
fits your needs and if not, you can even write your
own codec and register it with Python, e.g. UTF-32
which we currently don't support ;-)

Please upload your doc-patch to SF.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread Nicholas Bastin


On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:

 However, I don't understand all the excitement
 about Py_UNICODE: if you don't like the way this Python
 typedef works, you are free to interface to Python using
 any of the supported encodings using PyUnicode_Encode()
 and PyUnicode_Decode(). I'm sure you'll find one that
 fits your needs and if not, you can even write your
 own codec and register it with Python, e.g. UTF-32
 which we currently don't support ;-)

My concerns about Py_UNICODE are completely separate from my 
frustration that the documentation is wrong about this type.  It is 
much more important that the documentation be correct, first, and then 
we can discuss the reasons why it can be one of two values, rather than 
just a uniform value across all python implementations.  This makes 
distributing binary extension modules hard.  It has become clear to me 
that no one on this list gives a *%^ about people attempting to 
distribute binary extension modules, or they would have cared about 
this problem, so I'll just drop that point.

However, somehow, what keeps getting lost in the mix is that 
--enable-unicode=ucs2 is a lie, and we should change what this 
configure option says.  Martin seems to disagree with me, for reasons 
that I don't understand.  I would be fine with calling the option 
utf16, or just 2 and 4, but not ucs2, as that means things that Python 
doesn't intend it to mean.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-07 Thread M.-A. Lemburg

Nicholas Bastin wrote:
 On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
 
 
However, I don't understand all the excitement
about Py_UNICODE: if you don't like the way this Python
typedef works, you are free to interface to Python using
any of the supported encodings using PyUnicode_Encode()
and PyUnicode_Decode(). I'm sure you'll find one that
fits your needs and if not, you can even write your
own codec and register it with Python, e.g. UTF-32
which we currently don't support ;-)
 
 
 My concerns about Py_UNICODE are completely separate from my 
 frustration that the documentation is wrong about this type.  It is 
 much more important that the documentation be correct, first, and then 
 we can discuss the reasons why it can be one of two values, rather than 
 just a uniform value across all python implementations.  This makes 
 distributing binary extension modules hard.  It has become clear to me 
 that no one on this list gives a *%^ about people attempting to 
 distribute binary extension modules, or they would have cared about 
 this problem, so I'll just drop that point.

Actually, many of us know about the problem of having to
ship UCS2 and UCS4 builds of binary extensions and the
troubles this causes with users.

It just adds one more dimension to the number of builds
you have to make - one for the Python version, another
for the platform and in the case of Linux another one for
the Unicode width. Nowadays most Linux distros ship UCS4
builds (after RedHat started this quest), so things start
to normalize again.

 However, somehow, what keeps getting lost in the mix is that 
 --enable-unicode=ucs2 is a lie, and we should change what this 
 configure option says.  Martin seems to disagree with me, for reasons 
 that I don't understand.  I would be fine with calling the option 
 utf16, or just 2 and 4, but not ucs2, as that means things that Python 
 doesn't intend it to mean.

It's not a lie: the Unicode implementation does work with
UCS2 code points (surrogate values are Unicode code points as
well - they happen to live in a special zone of the BMP).

Only the codecs add support for surrogates in a way that
allows round-trip safety regardless of whether you used UCS2
or UCS4 as compile time option.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread M.-A. Lemburg

Nicholas Bastin wrote:
 On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
 
Nicholas Bastin wrote:


This type represents the storage type which is used by Python
internally as the basis for holding Unicode ordinals.  Extension 
module
developers should make no assumptions about the size of this type on
any given platform.


But people want to know Is Python's Unicode 16-bit or 32-bit?
So the documentation should explicitly say it depends.

On a related note, it would be help if the documentation provided a
little more background on unicode encoding.  Specifically, that UCS-2 
is
not the same as UTF-16, even though they're both two bytes wide and 
most
of the characters are the same.  UTF-16 can encode 4 byte characters,
while UCS-2 can't.  A Py_UNICODE is either UCS-2 or UCS-4.  It took me
 
 I'm not sure the Python documentation is the place to teach someone 
 about unicode.  The ISO 10646 pretty clearly defines UCS-2 as only 
 containing characters in the BMP (plane zero).  On the other hand, I 
 don't know why python lets you choose UCS-2 anyhow, since it's almost 
 always not what you want.

You've got that wrong: Python let's you choose UCS-4 -
UCS-2 is the default.

Note that Python's Unicode codecs UTF-8 and UTF-16
are surrogate aware and thus support non-BMP code points
regardless of the build type: A UCS2-build of Python will
store a non-BMP code point as UTF-16 surrogate pair in the
Py_UNICODE buffer while a UCS4 build will store it as a
single value. Decoding is surrogate aware too, so a UTF-16
surrogate pair in a UCS2 build will get treated as single
Unicode code point.

Ideally, the Python programmer should not really need to
know all this and I think we've achieved that up to certain
point (Unicode can be complicated - there's nothing to hide there).
However, the C progammer using the Python C API to interface
to some other Unicode implementation will need to know these
details.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread M.-A. Lemburg

Fredrik Lundh wrote:
 Thomas Heller wrote:
 
 
AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars,
independend from the size of wchar_t.  The HAVE_USABLE_WCHAR_T macro
can be used by extension writers to determine if Py_UNICODE is the same as
wchar_t.
 
 
 note that usable is more than just same size; it also implies that 
 widechar
 predicates (iswalnum etc) works properly with Unicode characters, under all
 locales.

Only if you intend to use --with-wctypes; a configure option which
will go away soon (for exactly the reason you are referring to: the
widechar predicates don't work properly under all locales).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread M.-A. Lemburg

Nicholas Bastin wrote:
 On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:
 
 
Nicholas Bastin wrote:

This type represents the storage type which is used by Python
internally as the basis for holding Unicode ordinals.  Extension 
module
developers should make no assumptions about the size of this type on
any given platform.

But people want to know Is Python's Unicode 16-bit or 32-bit?
So the documentation should explicitly say it depends.
 
 
 The important piece of information is that it is not guaranteed to be a 
 particular one of those sizes.  Once you can't guarantee the size, no 
 one really cares what size it is.  The documentation should discourage 
 developers from attempting to manipulate Py_UNICODE directly, which, 
 other than trivia, is the only reason why someone would care what size 
 the internal representation is.

I don't see why you shouldn't use Py_UNICODE buffer directly.
After all, the reason why we have that typedef is to make it
possible to program against an abstract type - regardless of
its size on the given platform.

In that respect it is similar to wchar_t (and all the other
*_t typedefs in C).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 3:25 AM, M.-A. Lemburg wrote:

 I don't see why you shouldn't use Py_UNICODE buffer directly.
 After all, the reason why we have that typedef is to make it
 possible to program against an abstract type - regardless of
 its size on the given platform.

Because the encoding of that buffer appears to be different depending 
on the configure options.  If that isn't true, then someone needs to 
change the doc, and the configure options.  Right now, it seems *very* 
clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read 
the configure help, and you can't use the buffer directly if the 
encoding is variable.  However, you seem to be saying that this isn't 
true.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:

 You've got that wrong: Python let's you choose UCS-4 -
 UCS-2 is the default.

No, that's not true.  Python lets you choose UCS-4 or UCS-2.  What the 
default is depends on your platform.  If you run raw configure, some 
systems will choose UCS-4, and some will choose UCS-2.  This is how the 
conversation came about in the first place - running ./configure on 
RHL9 gives you UCS-4.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread James Y Knight

On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
 If this is the case, then we're clearly misleading users.  If the
 configure script says UCS-2, then as a user I would assume that
 surrogate pairs would *not* be encoded, because I chose UCS-2, and it
 doesn't support that.  I would assume that any UTF-16 string I would
 read would be transcoded into the internal type (UCS-2), and
 information would be lost.  If this is not the case, then what does the
 configure option mean?

It means all the string operations treat strings as if they were UCS-2, 
but that in actuality, they are UTF-16. Same as the case in the windows 
APIs and Java. That is, all string operations are essentially broken, 
because they're operating on encoded bytes, not characters, but claim 
to be operating on characters.

James

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc (Another Attempt)

2005-05-06 Thread Nicholas Bastin

After reading through the code and the comments in this thread, I 
propose the following in the documentation as the definition of 
Py_UNICODE:

This type represents the storage type which is used by Python 
internally as the basis for holding Unicode ordinals.  Extension module 
developers should make no assumptions about the size or native encoding 
of this type on any given platform.

The main point here is that extension developers can not safely slam 
Py_UNICODE (which it appeared was true when the documentation stated 
that it was always 16-bits).

I don't propose that we put this information in the doc, but the 
possible internal representations are:

2-byte wchar_t or unsigned short encoded as UTF-16
4-byte wchar_t encoded as UTF-32 (UCS-4)

If you do not explicitly set the configure option, you cannot guarantee 
which you will get.  Python also does not normalize the byte order of 
unicode strings passed into it from C (via PyUnicode_EncodeUTF16, for 
example), so it is possible to have UTF-16LE and UTF-16BE strings in 
the system at the same time, which is a bit confusing.  This may or may 
not be worth a mention in the doc (or a patch).

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 3:42 PM, James Y Knight wrote:

 On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
 If this is the case, then we're clearly misleading users.  If the
 configure script says UCS-2, then as a user I would assume that
 surrogate pairs would *not* be encoded, because I chose UCS-2, and it
 doesn't support that.  I would assume that any UTF-16 string I would
 read would be transcoded into the internal type (UCS-2), and
 information would be lost.  If this is not the case, then what does 
 the
 configure option mean?

 It means all the string operations treat strings as if they were 
 UCS-2, but that in actuality, they are UTF-16. Same as the case in the 
 windows APIs and Java. That is, all string operations are essentially 
 broken, because they're operating on encoded bytes, not characters, 
 but claim to be operating on characters.

Well, this is a completely separate issue/problem. The internal 
representation is UTF-16, and should be stated as such.  If the 
built-in methods actually don't work with surrogate pairs, then that 
should be fixed.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Shane Hathaway

Nicholas Bastin wrote:
 On May 6, 2005, at 3:42 PM, James Y Knight wrote:
It means all the string operations treat strings as if they were 
UCS-2, but that in actuality, they are UTF-16. Same as the case in the 
windows APIs and Java. That is, all string operations are essentially 
broken, because they're operating on encoded bytes, not characters, 
but claim to be operating on characters.
 
 
 Well, this is a completely separate issue/problem. The internal 
 representation is UTF-16, and should be stated as such.  If the 
 built-in methods actually don't work with surrogate pairs, then that 
 should be fixed.

Wait... are you saying a Py_UNICODE array contains either UTF-16 or
UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I may
need to change my PyXPCOM patch to fit this new understanding.  I tried
hard to not care how Python encodes unicode characters, but details like
this are important when combining two frameworks with different unicode
APIs.

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:

 Nicholas Bastin wrote:
 On May 6, 2005, at 3:42 PM, James Y Knight wrote:
 It means all the string operations treat strings as if they were
 UCS-2, but that in actuality, they are UTF-16. Same as the case in 
 the
 windows APIs and Java. That is, all string operations are essentially
 broken, because they're operating on encoded bytes, not characters,
 but claim to be operating on characters.


 Well, this is a completely separate issue/problem. The internal
 representation is UTF-16, and should be stated as such.  If the
 built-in methods actually don't work with surrogate pairs, then that
 should be fixed.

 Wait... are you saying a Py_UNICODE array contains either UTF-16 or
 UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I 
 may
 need to change my PyXPCOM patch to fit this new understanding.  I tried
 hard to not care how Python encodes unicode characters, but details 
 like
 this are important when combining two frameworks with different unicode
 APIs.

Yes.  Well, in as much as a large part of UTF-16 directly overlaps 
UCS-2, then sometimes unicode strings contain UCS-2 characters.  
However, characters which would not be legal in UCS-2 are still encoded 
properly in python, in UTF-16.

And yes, I feel your pain, that's how I *got* into this position.  
Mapping from external unicode types is an important aspect of writing 
extension modules, and the documentation does not help people trying to 
do this.  The fact that python's internal encoding is variable is a 
huge problem in and of itself, even if that was documented properly.  
This is why tools like Xerces and ICU will be happy to give you 
whatever form of unicode strings you want, but internally they always 
use UTF-16 - to avoid having to write two internal implementations of 
the same functionality.  If you look up and down 
Objects/unicodeobject.c you'll see a fair amount of code written a 
couple of different ways (using #ifdef's) because of the variability in 
the internal representation.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Shane Hathaway

Nicholas Bastin wrote:
 
 On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
 Wait... are you saying a Py_UNICODE array contains either UTF-16 or
 UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I may
 need to change my PyXPCOM patch to fit this new understanding.  I tried
 hard to not care how Python encodes unicode characters, but details like
 this are important when combining two frameworks with different unicode
 APIs.
 
 
 Yes.  Well, in as much as a large part of UTF-16 directly overlaps
 UCS-2, then sometimes unicode strings contain UCS-2 characters. 
 However, characters which would not be legal in UCS-2 are still encoded
 properly in python, in UTF-16.
 
 And yes, I feel your pain, that's how I *got* into this position. 
 Mapping from external unicode types is an important aspect of writing
 extension modules, and the documentation does not help people trying to
 do this.  The fact that python's internal encoding is variable is a huge
 problem in and of itself, even if that was documented properly.  This is
 why tools like Xerces and ICU will be happy to give you whatever form of
 unicode strings you want, but internally they always use UTF-16 - to
 avoid having to write two internal implementations of the same
 functionality.  If you look up and down Objects/unicodeobject.c you'll
 see a fair amount of code written a couple of different ways (using
 #ifdef's) because of the variability in the internal representation.

Ok.  Thanks for helping me understand where Python is WRT unicode.  I
can work around the issues (or maybe try to help solve them) now that I
know the current state of affairs.  If Python correctly handled UTF-16
strings internally, we wouldn't need the UCS-4 configuration switch,
would we?

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Nicholas Bastin wrote:
 The important piece of information is that it is not guaranteed to be a
 particular one of those sizes.  Once you can't guarantee the size, no
 one really cares what size it is.

Please trust many years of experience: This is just not true. People
do care, and they want to know. If we tell them it depends, they
ask how can I find out.

 The documentation should discourage
 developers from attempting to manipulate Py_UNICODE directly, which,
 other than trivia, is the only reason why someone would care what size
 the internal representation is.

Why is that? Of *course* people will have to manipulate Py_UNICODE*
buffers directly. What else can they use?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Nicholas Bastin wrote:
 If this is the case, then we're clearly misleading users.  If the
 configure script says UCS-2, then as a user I would assume that
 surrogate pairs would *not* be encoded, because I chose UCS-2, and it
 doesn't support that.

What do you mean by that? That the interpreter crashes if you try
to store a low surrogate into a Py_UNICODE?

 I would assume that any UTF-16 string I would
 read would be transcoded into the internal type (UCS-2), and information
 would be lost.  If this is not the case, then what does the configure
 option mean?

It tells you whether you have the two-octet form of the Universal
Character Set, or the four-octet form.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Nicholas Bastin wrote:
 Because the encoding of that buffer appears to be different depending on
 the configure options.

What makes it appear so? sizeof(Py_UNICODE) changes when you change
the option - does that, in your mind, mean that the encoding changes?

 If that isn't true, then someone needs to change
 the doc, and the configure options.  Right now, it seems *very* clear
 that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the
 configure help, and you can't use the buffer directly if the encoding is
 variable.  However, you seem to be saying that this isn't true.

It's a compile-time option (as all configure options). So at run-time,
it isn't variable.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Nicholas Bastin wrote:
 No, that's not true.  Python lets you choose UCS-4 or UCS-2.  What the
 default is depends on your platform.

The truth is more complicated. If your Tcl is built for UCS-4, then
Python will also be built for UCS-4 (unless overridden by command line).
Otherwise, Python will default to UCS-2.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

M.-A. Lemburg wrote:
 Hmm, looking at the configure.in script, it seems you're right.
 I wonder why this weird dependency on TCL was added.

If Python is configured for UCS-2, and Tcl for UCS-4, then
Tkinter would not work out of the box. Hence the weird dependency.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 If this is the case, then we're clearly misleading users.  If the
 configure script says UCS-2, then as a user I would assume that
 surrogate pairs would *not* be encoded, because I chose UCS-2, and it
 doesn't support that.

 What do you mean by that? That the interpreter crashes if you try
 to store a low surrogate into a Py_UNICODE?

What I mean is pretty clear.  UCS-2 does *NOT* support surrogate pairs. 
  If it did, it would be called UTF-16.  If Python really supported 
UCS-2, then surrogate pairs from UTF-16 inputs would either get turned 
into two garbage characters, or the I couldn't transcode this UCS-2 
code point (I don't remember which on that is off the top of my head).

 I would assume that any UTF-16 string I would
 read would be transcoded into the internal type (UCS-2), and 
 information
 would be lost.  If this is not the case, then what does the configure
 option mean?

 It tells you whether you have the two-octet form of the Universal
 Character Set, or the four-octet form.

It would, if that were the case, but it's not.  Setting UCS-2 in the 
configure script really means UTF-16, and as such, the documentation 
should reflect that.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 7:45 PM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 Because the encoding of that buffer appears to be different depending 
 on
 the configure options.

 What makes it appear so? sizeof(Py_UNICODE) changes when you change
 the option - does that, in your mind, mean that the encoding changes?

Yes.  Not only in my mind, but in the Python source code.  If 
Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), 
otherwise the encoding is UTF-16 (*not* UCS-2).

 If that isn't true, then someone needs to change
 the doc, and the configure options.  Right now, it seems *very* clear
 that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the
 configure help, and you can't use the buffer directly if the encoding 
 is
 variable.  However, you seem to be saying that this isn't true.

 It's a compile-time option (as all configure options). So at run-time,
 it isn't variable.

What I mean by 'variable' is that you can't make any assumption as to 
what the size will be in any given python when you're writing (and 
building) an extension module.  This breaks binary compatibility of 
extensions modules on the same platform and same version of python 
across interpreters which may have been built with different configure 
options.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Shane Hathaway wrote:
 Ok.  Thanks for helping me understand where Python is WRT unicode.  I
 can work around the issues (or maybe try to help solve them) now that I
 know the current state of affairs.  If Python correctly handled UTF-16
 strings internally, we wouldn't need the UCS-4 configuration switch,
 would we?

Define correctly. Python, in ucs2 mode, will allow to address individual
surrogate codes, e.g. in indexing. So you get

 u\U00012345[0]
u'\ud808'

This will never work correctly, and never should, because an efficient
implementation isn't possible. If you want safe indexing and slicing,
you need ucs4.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Martin v. Löwis

Nicholas Bastin wrote:
 What I mean is pretty clear.  UCS-2 does *NOT* support surrogate pairs. 
   If it did, it would be called UTF-16.  If Python really supported 
 UCS-2, then surrogate pairs from UTF-16 inputs would either get turned 
 into two garbage characters, or the I couldn't transcode this UCS-2 
 code point (I don't remember which on that is off the top of my head).

OTOH, if Python really supported UTF-16, then unichr(0x1) would
work, and len(u\U0001) would be 1.

It is primarily just the UTF-8 codec which supports UTF-16.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 Yes.  Not only in my mind, but in the Python source code.  If
 Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4),
 otherwise the encoding is UTF-16 (*not* UCS-2).

 I see. Some people equate encoding with encoding scheme;
 neither UTF-32 nor UTF-16 is an encoding scheme. You were

That's not true.  UTF-16 and UTF-32 are both CES and CEF (although this 
is not true of UTF-16LE and BE).  UTF-32 is a fixed-width encoding form 
within a code space of (0..10) and UTF-16 is a variable-width 
encoding form which provides a mix of one of two 16-bit code units in 
the code space of (0..).  However, you are perhaps right to point 
out that people should be more explicit as to which they are referring 
to.  UCS-2, however, is only a CEF, and thus I thought it was obvious 
that I was referring to UTF-16 as a CEF.  I would point anyone who is 
confused as this point to Unicode Technical Report #17 on the Character 
Encoding Model, which is much more clear than trying to piece together 
the relevant parts out of the entire standard.

In any event, Python's use of the term UCS-2 is incorrect.  I quote 
from the TR:

The UCS-2 encoding form, which is associated with ISO/IEC 10646 and 
can only express characters in the  BMP, is a fixed-width encoding 
form.

immediately followed by:

In contrast, UTF-16 uses either one or two code  units and is able to 
cover the entire code space of Unicode.

If Python is capable of representing the entire code space of Unicode 
when you choose --unicode=ucs2, then that is a bug.  It either should 
not be called UCS-2, or the interpreter should be bound by the 
limitations of the UCS-2 CEF.


 What I mean by 'variable' is that you can't make any assumption as to
 what the size will be in any given python when you're writing (and
 building) an extension module.  This breaks binary compatibility of
 extensions modules on the same platform and same version of python
 across interpreters which may have been built with different configure
 options.

 True. The breakage will be quite obvious, in most cases: the module
 fails to load because not only sizeof(Py_UNICODE) changes, but also
 the names of all symbols change.

Yes, but the important question here is why would we want that?  Why 
doesn't Python just have *one* internal representation of a Unicode 
character?  Having more than one possible definition just creates 
problems, and provides no value.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-06 Thread Nicholas Bastin


On May 6, 2005, at 8:11 PM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 Well, this is a completely separate issue/problem. The internal
 representation is UTF-16, and should be stated as such.  If the
 built-in methods actually don't work with surrogate pairs, then that
 should be fixed.

 Yes to the former, no to the latter. PEP 261 specifies what should
 and shouldn't work.

This PEP has several textual errors and ambiguities (which, admittedly, 
may have been a necessary state given the unicode standard in 2001).  
However, putting that aside, I would recommend that:

--enable-unicode=ucs2

be replaced with:

--enable-unicode=utf16

and the docs be updated to reflect more accurately the variance of the 
internal storage type.

I would also like the community to strongly consider standardizing on a 
single internal representation, but I will leave that fight for another 
day.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-05 Thread Nicholas Bastin


On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:

 Nicholas Bastin wrote:
 This type represents the storage type which is used by Python
 internally as the basis for holding Unicode ordinals.  Extension 
 module
 developers should make no assumptions about the size of this type on
 any given platform.

 But people want to know Is Python's Unicode 16-bit or 32-bit?
 So the documentation should explicitly say it depends.

The important piece of information is that it is not guaranteed to be a 
particular one of those sizes.  Once you can't guarantee the size, no 
one really cares what size it is.  The documentation should discourage 
developers from attempting to manipulate Py_UNICODE directly, which, 
other than trivia, is the only reason why someone would care what size 
the internal representation is.

--
Nick
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-05 Thread Shane Hathaway

Nicholas Bastin wrote:
 
 On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
 On a related note, it would be help if the documentation provided a
 little more background on unicode encoding.  Specifically, that UCS-2 is
 not the same as UTF-16, even though they're both two bytes wide and most
 of the characters are the same.  UTF-16 can encode 4 byte characters,
 while UCS-2 can't.  A Py_UNICODE is either UCS-2 or UCS-4.  It took me
 
 
 I'm not sure the Python documentation is the place to teach someone
 about unicode.  The ISO 10646 pretty clearly defines UCS-2 as only
 containing characters in the BMP (plane zero).  On the other hand, I
 don't know why python lets you choose UCS-2 anyhow, since it's almost
 always not what you want.

Then something in the Python docs ought to say why UCS-2 is not what you
want.  I still don't know; I've heard differing opinions on the subject.
 Some say you'll never need more than what UCS-2 provides.  Is that
incorrect?

More generally, how should a non-unicode-expert writing Python extension
code find out the minimum they need to know about unicode to use the
Python unicode API?  The API reference [1] ought to at least have a list
of background links.  I had to hunt everywhere.

.. [1] http://docs.python.org/api/unicodeObjects.html

Shane
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-04 Thread Michael Hudson

Nicholas Bastin [EMAIL PROTECTED] writes:

 The current documentation for Py_UNICODE states:

 This type represents a 16-bit unsigned storage type which is used by  
 Python internally as basis for holding Unicode ordinals. On  platforms 
 where wchar_t is available and also has 16-bits,  Py_UNICODE is a 
 typedef alias for wchar_t to enhance  native platform compatibility. On 
 all other platforms,  Py_UNICODE is a typedef alias for unsigned 
 short.

 I propose changing this to:

 This type represents the storage type which is used by Python 
 internally as the basis for holding Unicode ordinals.  On platforms 
 where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t 
 to enhance native platform compatibility.

This just isn't true.  Have you read ./configure --help recently?

 On all other platforms, Py_UNICODE is a typedef alias for unsigned
 short.  Extension module developers should make no assumptions about
 the size of this type on any given platform.

I like this last sentence, though.

 If no one has a problem with that, I'll make the change in CVS.

I have a problem with replacing one lie with another :)

Cheers,
mwh

-- 
  Just put the user directories on a 486 with deadrat7.1 and turn the
  Octane into the afforementioned beer fridge and keep it in your
  office. The lusers won't notice the difference, except that you're
  more cheery during office hours.  -- Pim van Riezen, asr
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-04 Thread Nicholas Bastin


On May 4, 2005, at 1:02 PM, Michael Hudson wrote:

 Nicholas Bastin [EMAIL PROTECTED] writes:

 The current documentation for Py_UNICODE states:

 This type represents a 16-bit unsigned storage type which is used by
 Python internally as basis for holding Unicode ordinals. On  platforms
 where wchar_t is available and also has 16-bits,  Py_UNICODE is a
 typedef alias for wchar_t to enhance  native platform compatibility. 
 On
 all other platforms,  Py_UNICODE is a typedef alias for unsigned
 short.

 I propose changing this to:

 This type represents the storage type which is used by Python
 internally as the basis for holding Unicode ordinals.  On platforms
 where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t
 to enhance native platform compatibility.

 This just isn't true.  Have you read ./configure --help recently?

Ok, so the above statement is true if the user does not set 
--enable-unicode=ucs[24] (was reading the whar_t test in configure.in, 
and not the generated configure help).

Alternatively, we shouldn't talk about the size at all, and just leave 
the first and last sentences:

This type represents the storage type which is used by Python 
internally as the basis for holding Unicode ordinals.  Extension module 
developers should make no assumptions about the size of this type on 
any given platform.

--
Nick

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-04 Thread Fredrik Lundh

Thomas Heller wrote:

 AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars,
 independend from the size of wchar_t.  The HAVE_USABLE_WCHAR_T macro
 can be used by extension writers to determine if Py_UNICODE is the same as
 wchar_t.

note that usable is more than just same size; it also implies that widechar
predicates (iswalnum etc) works properly with Unicode characters, under all
locales.

/F



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

2005-05-04 Thread Martin v. Löwis

Nicholas Bastin wrote:
 This type represents the storage type which is used by Python 
 internally as the basis for holding Unicode ordinals.  Extension module 
 developers should make no assumptions about the size of this type on 
 any given platform.

But people want to know Is Python's Unicode 16-bit or 32-bit?
So the documentation should explicitly say it depends.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

48 matches

Mail list logo