Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Stefan Behnel

Stefan Behnel, 25.08.2011 23:30:

Stefan Behnel, 25.08.2011 20:47:

"Martin v. Löwis", 24.08.2011 20:15:

- issues to be considered (unclarities, bugs, limitations, ...)


A problem of the current implementation is the need for calling
PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to
insufficient memory). Basically, this means that even something as trivial
as trying to get the length of a Unicode string can now result in an error.


Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there
is *any* code out there that expects this macro to ever return NULL. This
means that the current implementation has actually broken the old API. Just
allocate an "80% of your memory" long string using the new API and then
call PyUnicode_AS_UNICODE() on it to see what I mean.

Sadly, a quick look at a couple of recent commits in the pep-393 branch
suggested that it is not even always obvious to you as the authors which
macros can be called safely and which cannot. I immediately spotted a bug
in one of the updated core functions (unicode_repr, IIRC) where
PyUnicode_GET_LENGTH() is called without a previous call to
PyUnicode_FAST_READY().

I find it everything but obvious that calling PyUnicode_DATA() and
PyUnicode_KIND() is safe as long as the return value is being checked for
errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a
previous call to PyUnicode_Ready().


And, adding to my own mail yet another time, the current header file states 
this:


"""
/* String contains only wstr byte characters.  This is only possible
   when the string was created with a legacy API and PyUnicode_Ready()
   has not been called yet.  Note that PyUnicode_KIND() calls
   PyUnicode_FAST_READY() so PyUnicode_WCHAR_KIND is only possible as a
   intialized value not as a result of PyUnicode_KIND(). */
#define PyUnicode_WCHAR_KIND 0
"""

From my understanding, this is incorrect. When I call PyUnicode_KIND() on 
an old style object and it fails to allocate the string buffer, I would 
expect that I actually get PyUnicode_WCHAR_KIND back as a result, as the 
SSTATE_KIND_* value in the "state" field has not been initialised yet at 
that point.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows installers and %PATH%

2011-08-25 Thread Nick Coghlan
On Fri, Aug 26, 2011 at 2:04 PM, Andrew Pennebaker
 wrote:
> Please have the Windows installers add the Python installation directory to
> the PATH environment variable.

Please read PEP 397: Python Launcher for Windows.

Or at least do us the courtesy of acknowledging that if the issue was
as simple as "just munge the PATH", it would have been done long ago.
Windows is a developer hostile platform unless you completely buy into
the Microsoft toolchain, which is not an option for cross-platform
projects like Python.

It's well within Microsoft's capabilities to create and support a
POSIX compatibility layer that allows applications to look and feel
like native ones, but they choose not to, since they see
cross-platform development as a competitive threat to their desktop
dominance. There's a reason many open source projects don't offer
native support at all, instead instructing people to use Cygwin as a
compatibility layer.

It irks me greatly when people place the blame for this situation on
volunteer programmers giving them stuff for free instead of where it
belongs (i.e. on the multibillion dollar corporation deliberately
failing to implement a widely recognised OS interoperability
standard).

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Stefan Behnel

Isaac Morland, 26.08.2011 04:28:

On Thu, 25 Aug 2011, Guido van Rossum wrote:

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).


If it's called UTF-8, there is no decision to be taken as to decoder
behaviour - any byte sequence not permitted by the Unicode standard must
result in an error (although, of course, *how* the error is to be reported
could legitimately be the subject of endless discussion). There are
security implications to violating the standard so this isn't just
legalistic purity.

Hmmm, doesn't look good:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xed\xb0\x80'.decode ('utf-8')
u'\udc00'
>>>

Incorrect! Although this is a narrow build - I can't say what the wide
build would do.


Works the same for me in a wide Py2.7 build, but gives me this in Py3:

Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xed\xb0\x80'.decode ('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: 
illegal encoding


Same for current Py3.3 and the PEP393 build (although both have a better 
exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes 
in position 0-1: invalid continuation byte").


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Windows installers and %PATH%

2011-08-25 Thread John O'Connor
+ 0 for automatically adding to %PATH%

+ 1 for providing an option to the user during install

- John


On Thu, Aug 25, 2011 at 9:04 PM, Andrew Pennebaker <
andrew.penneba...@gmail.com> wrote:

> Please have the Windows installers add the Python installation directory to
> the PATH environment variable.
>
> Many newbies dive in without knowing that they must manually add
> C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that
> should have been done by the installers way back in Python 1.0.
>
> Please also add PYTHONROOT\Scripts. It's where cool things like
> easy_install.exe are stored. More yak shaving.
>
> The only potential downside to this is upsetting users who manage multiple
> python installations. It's not a problem: they already manually adjust PATH
> to their liking.
>
> Cheers,
>
> Andrew Pennebaker
> www.yellosoft.us
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/tehjcon%40gmail.com
>
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Windows installers and %PATH%

2011-08-25 Thread Andrew Pennebaker
Please have the Windows installers add the Python installation directory to
the PATH environment variable.

Many newbies dive in without knowing that they must manually add C:\PythonXY
to PATH. It's yak shaving, something perfectly automatable that should have
been done by the installers way back in Python 1.0.

Please also add PYTHONROOT\Scripts. It's where cool things like
easy_install.exe are stored. More yak shaving.

The only potential downside to this is upsetting users who manage multiple
python installations. It's not a problem: they already manually adjust PATH
to their liking.

Cheers,

Andrew Pennebaker
www.yellosoft.us
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland  wrote:
> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>
>> I'm not sure what should happen with UTF-8 when it (in flagrant
>> violation of the standard, I presume) contains two separately-encoded
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>> codec does on a wide build today should be good enough. Similarly for
>> encoding to UTF-8 on a wide build if one managed to create a string
>> containing a surrogate pair. Basically, I'm for a
>> garbage-in-garbage-out approach (with separate library functions to
>> detect garbage if the app is worried about it).
>
> If it's called UTF-8, there is no decision to be taken as to decoder
> behaviour - any byte sequence not permitted by the Unicode standard must
> result in an error (although, of course, *how* the error is to be reported
> could legitimately be the subject of endless discussion).  There are
> security implications to violating the standard so this isn't just
> legalistic purity.

You have a point. The security issues cannot be seen separate from all
the other issues. The folks inside Google who care about Unicode often
harp on this. So I stand corrected. I am fine with codecs treating
code points or code point sequences that the Unicode standard doesn't
like (e.g. lone surrogates) the same way as more severe errors in the
encoded bytes (lots of byte sequences already aren't valid UTF-8). I
just hope this doesn't require normal forms or other expensive
operations; I hope it's limited to rejecting invalid use of surrogates
or other values that are not valid code points (e.g. 0, or >= 2**21).

> Hmmm, doesn't look good:
>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.

 '\xed\xb0\x80'.decode ('utf-8')
>
> u'\udc00'

>
> Incorrect!  Although this is a narrow build - I can't say what the wide
> build would do.
>
> For reasons of practicality, it may be appropriate to provide easy access to
> a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be
> called UTF-8.  Other variations may also find use if provided.
>
> See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
>
> And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Thanks for the links! I also like the term "supplemental character" (a
code point >= 2**16). And I note that they talk about characters were
we've just agreed that we should say code points...

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti  wrote:
> On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum  wrote:
>>
>> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy  wrote:
>> > Excuse me for believing the fine 3.2 manual that says
>> > "Strings contain Unicode characters." (And to a naive reader, that
>> > implies
>> > that string iteration and indexing should produce Unicode characters.)
>>
>> The naive reader also doesn't know the difference between characters,
>> code points and code units. It's the advanced, Unicode-aware reader
>> who is confused by this phrase in the docs. It should say code units;
>> or perhaps code units for narrow builds and code points for wide
>> builds.
>
> For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be
> correct.  Also note that:
>   * for both, every "code unit" has a specific "codepoint" (including lone
> surrogates), so it might be OK to talk about "codepoints" too, but
>   * only for wide builds every "codepoints" is represented by a single,
> 32-bits "code unit".  In narrow builds, non-BMP chars are represented by a
> "code unit sequence" of two elements (i.e. a "surrogate pair").

The more I think about it the more it seems to me that the biggest
problem is that in narrow builds it is ambiguous whether (unicode)
strings contain code units, i.e. are *encoded* code points, or whether
they contain (decoded) code points. In a sense this is repeating the
ambiguity of 8-bit strings in Python 2, which are sometimes assumed to
contain ASCII or Latin-1 (i.e., code points with a limited range) or
UTF-8 (i.e., code units).

I know that by now I am repeating myself, but I think it would be
really good if we could get rid of this ambiguity. PEP 393 seems the
best way forward, even if it doesn't directly address what to do for
IronPython or Jython, both of which have to deal with a pervasive
native string type that contains UTF-16.

IIUC, CPython on Windows will work just fine with PEP 393, even if it
means that there is a bit more translation between Python strings and
the OS native wchar_t[] type. I assume that the data volumes going
through the OS APIs is relatively constrained, since data actually
written to or read from a file will still be bytes, possibly run
through a codec (if it's a text file), and not go through one of the
wchar_t[] APIs -- the latter are used for things like filenames, which
are much smaller.

> Since "code unit" refers to the *minimal* bit combination, in UTF-8
> characters that needs 2/3/4 bytes, are represented with a "code unit
> sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code
> points" overlaps only for the ASCII range).

Actually I think UTF-8 is best thought of as an encoding for code
points, not characters -- the subtle difference between these two
should be of no concern to the UTF-8 codec (unless it is a validating
codec).

>> With PEP 393 we can unconditionally say code points, which is
>> much better. We should try to remove our use of "characters" -- or
>> else we should *define* our use of the term "characters" as "what the
>> Unicode standard calls code points".
>
> Character usually works fine, especially for naive readers.  Even
> Unicode-aware readers often confuse between the several terms, so using a
> simple term and pointing to a more accurate description sounds like a better
> idea to me.

We may well have no choice -- there is just too much documentation
that naively refers to characters while really referring to code units
or code points.

> Note that there's also another important term[1]:
> """
> Unicode Scalar Value. Any Unicode code point except high-surrogate and
> low-surrogate code points. In other words, the ranges of integers 0 to
> D7FF16 and E00016 to 1016 inclusive.
> """

This seems to involve validation. I think all validation should be
sequestered to specific APIs (e.g. certain codecs) and the string type
should not care about it. Depending on what they are doing,
applications may have to be aware of many subtleties in order to
always avoid generating "invalid" (or not well-formed-- what's the
difference?) strings.

> For example the UTF codecs produce sequences of "code units" (of 8, 16, 32
> bits) that represent "scalar values"[2][3]:
>
> Chapter 3 [4] says:
> """
> 3.9 Unicode Encoding Forms
> The Unicode Standard supports three character encoding forms: UTF-32,
> UTF-16, and UTF-8. Each encoding form maps the Unicode code points
> U+..U+D7FF and U+E000..U+10 to unique code unit sequences. [...]

I really don't mind whether our codecs actually make exceptions for
surrogates (lone or otherwise). The only requirement I care about is
that surrogate-free strings round-trip correctly. Again, apps that
want to conform to the requirements regarding surrogates can implement
their own validation, and certainly at some point we should offer a
validation library as part of the stdlib -- but it should be up to the
app whether and when to use it.

>  D76 Unicode s

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Isaac Morland

On Thu, 25 Aug 2011, Guido van Rossum wrote:


I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).


If it's called UTF-8, there is no decision to be taken as to decoder 
behaviour - any byte sequence not permitted by the Unicode standard must 
result in an error (although, of course, *how* the error is to be reported 
could legitimately be the subject of endless discussion).  There are 
security implications to violating the standard so this isn't just 
legalistic purity.


Hmmm, doesn't look good:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

'\xed\xb0\x80'.decode ('utf-8')

u'\udc00'




Incorrect!  Although this is a narrow build - I can't say what the wide 
build would do.


For reasons of practicality, it may be appropriate to provide easy access 
to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must 
not be called UTF-8.  Other variations may also find use if provided.


See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt

And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Ezio Melotti
On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum  wrote:

> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy  wrote:
> > Excuse me for believing the fine 3.2 manual that says
> > "Strings contain Unicode characters." (And to a naive reader, that
> implies
> > that string iteration and indexing should produce Unicode characters.)
>
> The naive reader also doesn't know the difference between characters,
> code points and code units. It's the advanced, Unicode-aware reader
> who is confused by this phrase in the docs. It should say code units;
> or perhaps code units for narrow builds and code points for wide
> builds.


For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be
correct.  Also note that:
  * for both, every "code unit" has a specific "codepoint" (including lone
surrogates), so it might be OK to talk about "codepoints" too, but
  * only for wide builds every "codepoints" is represented by a single,
32-bits "code unit".  In narrow builds, non-BMP chars are represented by a
"code unit sequence" of two elements (i.e. a "surrogate pair").

Since "code unit" refers to the *minimal* bit combination, in UTF-8
characters that needs 2/3/4 bytes, are represented with a "code unit
sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code
points" overlaps only for the ASCII range).


> With PEP 393 we can unconditionally say code points, which is
> much better. We should try to remove our use of "characters" -- or
> else we should *define* our use of the term "characters" as "what the
> Unicode standard calls code points".
>

Character usually works fine, especially for naive readers.  Even
Unicode-aware readers often confuse between the several terms, so using a
simple term and pointing to a more accurate description sounds like a better
idea to me.

Note that there's also another important term[1]:
"""
*Unicode Scalar Value*. Any Unicode * code
point
* except high-surrogate and low-surrogate code points. In other words, the
ranges of integers 0 to D7FF16 and E00016 to 1016 inclusive.
"""
For example the UTF codecs produce sequences of "code units" (of 8, 16, 32
bits) that represent "scalar values"[2][3]:

Chapter 3 [4] says:
"""
3.9 Unicode Encoding Forms
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+..U+D7FF and U+E000..U+10 to unique code unit sequences. [...]
 D76 Unicode scalar value: Any Unicode code point except high-surrogate and
low-surrogate code points.
 • As a result of this definition, the set of Unicode scalar values
consists of the ranges 0 to D7FF and E000 to 10, inclusive.
 D77 Code unit: The minimal bit combination that can represent a unit of
encoded text for processing or interchange.
[...]
 D79 A Unicode encoding form assigns each Unicode scalar value to a unique
code unit sequence.
"""

On the other hand, Python Unicode strings are not limited to scalar values,
because they can also contain lone surrogates.


I hope this helps clarify the terminology a bit and doesn't add more
confusion, but if we want to use the Unicode terms we should get them
right.  (Also note that I might have misunderstood something, even if I've
been careful with the terms and I double-checked and quoted the relevant
parts of the Unicode standard.)

Best Regards,
Ezio Melotti


[0]: From the chapter 3 [4],
 D77 Code unit: The minimal bit combination that can represent a unit of
encoded text for processing or interchange.
   • Code units are particular units of computer storage. Other character
encoding standards typically use code units defined as 8-bit units—that is,
octets.
 The Unicode Standard uses 8-bit code units in the UTF-8 encoding form,
16-bit code units in the UTF-16 encoding form, and 32-bit code units in the
UTF-32 encoding form.
[1]: http://unicode.org/glossary/#unicode_scalar_value
[2]: Apparently Python 3 raises an error while encoding lone surrogates in
UTF-8, but it doesn't for UTF-16 and UTF-32.
>From the chapter 3 [4],
 D91: "Because surrogate code points are not Unicode scalar values, isolated
UTF-16 code units in the range 0xD800..0xDFFF are ill-formed."
 D92: "Because surrogate code points are not included in the set of Unicode
scalar values, UTF-32 code units in the range 0xD800..0xDFFF are
ill-formed."
I think this should be fixed.
[3]: Note that I'm talking about codecs used to encode/decode Unicode
strings to/from bytes here, it's perfectly fine for Python itself to
represent lone surrogates in its *internal* representations, regardless of
what encoding it's using.
[4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Thu, Aug 25, 2011 at 4:58 AM, Stephen J. Turnbull  wrote:
> The problem with your legalistic approach, as I see it, is that if our
> definition is looser than the users', all their surprises will be
> unpleasant.  That's not good.

I see no alternative to explicitly spelling out what all operations do
and let the user figure out whether that meets their needs. E.g. we
needn't say that the str type or its == operator conforms to the
Unicode standard. We just need to say that the string type is a
sequence of code points, that string operations don't do validation or
normalization, and that to do a comparison that takes the Unicode
std's definition of equivalence (or collation, etc.) into account you
must call a certain library method.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Thu, Aug 25, 2011 at 2:39 AM, Stephen J. Turnbull  wrote:
> If our process is working with an external process (the OS's file
> system driver) whose definition includes the statement that "File
> names are sequences of Unicode characters",

Does any OS actually say that? Don't they usually say "in a specific
normal form" or "they're just bytes"?

> then C6 says our process
> must compare canonically equivalent sequences that it takes to be file
> names as the same, whether or not they are in the same normalized
> form, or normalized at all, because we can't assume the file system
> will treat them as different.  If we do treat them as different, our
> users will get very upset (eg, if we don't signal a duplicate file
> name input by the user, and then the OS proceeds to overwrite an
> existing file).

The solution here is to let the OS do the check, e.g. with
os.path.exists() or os.stat(). It would be wrong to write an app that
checked for file existence by doing naive lookups in os.listdir()
output.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing  wrote:
> What about things like the surrogateescape codec that
> deliberately use code units in non-standard ways? Will
> tricks like that still be possible if the code-unit
> level is hidden from the programmer?

I would think that it should still be possible to explicitly put
surrogates into a string, using the appropriate \u escape or
chr(i) or some such approach; the basic string operations IMO
shouldn't bother with checking for well-formed character sequences
(just as they shouldn't care about normal forms). But decoding bytes
from UTF-16 should not leave any surrogate pairs in, since
interpreting those is part of the decoding.

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Dino Viehland
Guido wrote:
> Which reminds me. The PEP does not say what other Python
> implementations besides CPython should do. presumably Jython and
> IronPython will continue to use UTF-16, so presumably the language
> reference will still have to document that strings contain code units (not 
> code
> points) and the objections Tom Christiansen raised against this will remain
> true for those versions of Python. (I don't know about PyPy, they can
> presumably decide when they start their Py3k
> port.)
> 
> OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and
> we can lay the narrow build issues to rest? Can someone here speak for
> them?

The biggest difficulty for IronPython here would be dealing w/ .NET interop.
We can certainly introduce either an IronPython specific string class which
is similar to CPython's PyUnicodeObject or we could have multiple distinct
.NET types (IronPython.Runtime.AsciiString, System.String, and 
IronPython.Runtime.Ucs4String) which all appear as the same type to Python. 

But when Python is calling a .NET API it's always going to return a 
System.String 
which is UTF-16.  If we had to check and convert all of those strings when they 
cross into Python it would be very bad for performance.  Presumably we could
have a 4th type of "interop" string which lazily computes this but if we start
wrapping .Net strings we could also get into object identity issues.

We could stop using System.String in IronPython all together and say when 
working w/ .NET strings you get the .NET behavior and when working w/ Python 
strings you get the Python behavior.  I'm not sure how weird and confusing that 
would be but conversion from an Ipy string to a .NET string could remain cheap 
if 
both were UTF-16, and conversions from .NET strings to Ipy strings would only 
happen if the user did so explicitly.  

But it's a huge change - it'll almost certainly touch every single source file 
in 
IronPython.  I would think we'd get 3.2 done first and then think about what to
do here.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Ezio Melotti
On Wed, Aug 24, 2011 at 11:37 PM, Terry Reedy  wrote:

> On 8/24/2011 1:45 PM, Victor Stinner wrote:
>
>> Le 24/08/2011 02:46, Terry Reedy a écrit :
>>
>
>  I don't think that using UTF-16 with surrogate pairs is really a big
>> problem. A lot of work has been done to hide this. For example,
>> repr(chr(0x10)) now displays '\U0010' instead of two characters.
>> Ezio fixed recently str.is*() methods in Python 3.2+.
>>
>
> I greatly appreciate that he did. The * (lower,upper,title) methods
> apparently are not fixed yet as the corresponding new tests are currently
> skipped for narrow builds.


There are two reasons for this:
1) the str.is* methods get the string and return True/False, so it's enough
to iterate on the string, combine the surrogates, and check if the result
islower/upper/etc.
Methods like lower/upper/etc, afaiu, currently get only a copy of the
string, and modify that in place.  The current macros advance to the next
char during reading and writing, so it's not possible to use them to
read/write from/to the same string.  We could either change the macros to
not advance the pointer [0] (and do it manually in the other functions like
is*) or change the function to get the original string too.
2) I'm on vacation.

Best Regards,
Ezio Melotti

[0]: for lower/upper/title it should be possible to modify the string in
place, because these operations never converts a non-BMP char to a BMP one
(and vice versa), so if two surrogates are read, two surrogates will be
written after the transformation.  I'm not sure this will work with all the
methods though (e.g. str.translate).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 10:50 AM, "Martin v. Löwis"  wrote:
> Not with these words, though. As I recall, it's rather like (still
> with different words) "len() will stay O(1) forever, regardless of
> any perceived incorrectness of this choice".

And indexing/slicing will also be O(1).

> An attempt to change
> the builtins to introduce higher complexity for the sake of correctness
> is what he rejects. I think PEP 393 balances this well, keeping
> the O(1) operations in that complexity, while improving the cross-
> platform "correctness" of these functions.

+1, I am comfortable with the balance struck by the PEP.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
[Apologies for sending out a long stream of pointed responses, written
before I have fully digested this entire mega-thread. I don't have the
patience today to collect them all into a single mega-response.]

On Wed, Aug 24, 2011 at 10:45 AM, Victor Stinner
 wrote:
> Note: Java and the Qt library use also UTF-16 strings and have exactly the
> same "limitations" for str[n] and len(str).

Which reminds me. The PEP does not say what other Python
implementations besides CPython should do. presumably Jython and
IronPython will continue to use UTF-16, so presumably the language
reference will still have to document that strings contain code units
(not code points) and the objections Tom Christiansen raised against
this will remain true for those versions of Python. (I don't know
about PyPy, they can presumably decide when they start their Py3k
port.)

OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach
and we can lay the narrow build issues to rest? Can someone here speak
for them?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy  wrote:
> Excuse me for believing the fine 3.2 manual that says
> "Strings contain Unicode characters." (And to a naive reader, that implies
> that string iteration and indexing should produce Unicode characters.)

The naive reader also doesn't know the difference between characters,
code points and code units. It's the advanced, Unicode-aware reader
who is confused by this phrase in the docs. It should say code units;
or perhaps code units for narrow builds and code points for wide
builds. With PEP 393 we can unconditionally say code points, which is
much better. We should try to remove our use of "characters" -- or
else we should *define* our use of the term "characters" as "what the
Unicode standard calls code points".

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull
 wrote:
> Well, no, it gives the right answer according to the design.  unicode
> objects do not contain character strings.  By design, they contain
> code point strings.  Guido has made that absolutely clear on a number
> of occasions.

Actually, the situation is that in narrow builds, they contain code
units (which may have surrogates); in wide builds they contain code
points. I think this is the crux of Tom Christian's complaints about
narrow builds.

Here's proof that narrow builds contain code units, not code points
(i.e. use UTF-16, not UCS-2):

$ ./python
Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> a = u'\U00012345'
>>> a
u'\U00012345'
>>> len(a)
2
>>>

It's pretty clear that the interpreter is surrogate-aware, which to me
indicates the use of UTF-16.

Now in the PEP 393 branch:

./python
Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> a = '\U00012345'
>>> a
'𒍅'
>>> len(a)
1
>>>

And some proof that this branch does not care about surrogates:

>>> a = '\ud808'
>>> b = '\udf45'
>>> a
'\ud808'
>>> b
'\udf45'
>>> a + b
'\ud808\udf45'
>>> len(a+b)
2
>>>

However:

a = '\ud808\udf45'
>>> a
'𒍅'
>>> len(a)
1
>>>

Which to me merely shows it is smart when parsing string literals.

(I expect that regular 3.3 narrow builds behave similar to the 2.7
narrow build, and 3.3 wide builds behave similar to the pep-393 build;
I didn't have those lying around.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Guido van Rossum
On Tue, Aug 23, 2011 at 7:41 PM, Torsten Becker
 wrote:
> On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou  wrote:
>> Macros are useful to shield the abstraction from the implementation. If
>> you access the members directly, and the unicode object is represented
>> differently in some future version of Python (say e.g. with tagged
>> pointers), your code doesn't compile anymore.
>
> I agree with Antoine, from the experience of porting C code from 3.2
> to the PEP 393 unicode API, the additional encapsulation by macros
> made it much easier to change the implementation of what is a field,
> what is a field's actual name, and what needs to be calculated through
> a function.
>
> So, I would like to keep primary access as a macro but I see the point
> that it would make the struct clearer to access and I would not mind
> changing the struct to use a union.  But then most access currently is
> through macros so I am not sure how much benefit the union would bring
> as it mostly complicates the struct definition.

+1

> Also, common, now simple, checks for "unicode->str == NULL" would look
> more ambiguous with a union ("unicode->str.latin1 == NULL").

You could add an extra union field for that:

unicode->str.voidptr == NULL

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Guido van Rossum
On Thu, Aug 25, 2011 at 1:24 AM, "Martin v. Löwis"  wrote:
>> With this PEP, the unicode object overhead grows to 10 pointer-sized
>> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
>> Does it have any adverse effects?
>
> If I count correctly, it's only three *additional* words (compared to
> 3.2): four new ones, minus one that is removed. In addition, it drops
> a memory block. Assuming a malloc overhead of two pointers per malloc
> block, we get one additional pointer.
[...]

But strings are allocated via PyObject_Malloc(), i.e. the custom
arena-based allocator -- isn't its overhead (for small objects) less
than 2 pointers per block?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Stefan Behnel

Stefan Behnel, 25.08.2011 20:47:

"Martin v. Löwis", 24.08.2011 20:15:

- issues to be considered (unclarities, bugs, limitations, ...)


A problem of the current implementation is the need for calling
PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to
insufficient memory). Basically, this means that even something as trivial
as trying to get the length of a Unicode string can now result in an error.


Oh, and the same applies to PyUnicode_AS_UNICODE() now. I doubt that there 
is *any* code out there that expects this macro to ever return NULL. This 
means that the current implementation has actually broken the old API. Just 
allocate an "80% of your memory" long string using the new API and then 
call PyUnicode_AS_UNICODE() on it to see what I mean.


Sadly, a quick look at a couple of recent commits in the pep-393 branch 
suggested that it is not even always obvious to you as the authors which 
macros can be called safely and which cannot. I immediately spotted a bug 
in one of the updated core functions (unicode_repr, IIRC) where 
PyUnicode_GET_LENGTH() is called without a previous call to 
PyUnicode_FAST_READY().


I find it everything but obvious that calling PyUnicode_DATA() and 
PyUnicode_KIND() is safe as long as the return value is being checked for 
errors, but calling PyUnicode_GET_LENGTH() is not safe unless there was a 
previous call to PyUnicode_Ready().




I just noticed this when rewriting Cython's helper function that searches a
unicode string for a (Py_UCS4) character. Previously, the entire function
was safe, could never produce an error and therefore always returned a
boolean result. In the new world, the caller of this function must check
and propagate errors. This may not be a major issue in most cases, but it
can have a non-trivial impact on user code, depending on how deep in a call
chain this happens and on how much control the user has over the call chain
(think of a C callback, for example).

Also, even in the case that there is no error, the potential need to build
up the string on request means that the run time and memory requirements of
an algorithm are less predictable now as they depend on the origin of the
input and not just its Python level string content.

I would be happier with an implementation that avoided this by always
instantiating the data buffer right from the start, instead of carrying
only a Py_UNICODE buffer for old-style instances.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Sphinx version for Python 2.x docs

2011-08-25 Thread Łukasz Langa
Wiadomość napisana przez Sandro Tosi w dniu 23 sie 2011, o godz. 01:09:What I want to understand if it's an acceptable change.I see sphinx more as of an internal, building tool, so freezing itit's like saying "don't upgrade gcc" or so.Normally I'd say it's natural for us to specify that for a legacy release we're using build tools in versions up to so-and-so. Plus, requiring changes in the repository additionally points that this is indeed touching "frozen" code.In case of 2.7 though, it's our "LTS release" so I think if Georg agrees, I'm also in favor of the upgrade.As for Sphinx using svn.python.org, the main issue is not altering the scripts to use Hg, it's the weight of the whole Sphinx repository that would have to be cloned for each distclean. By using SVN you're only downloading a specifically tagged source tree.
-- Best regards,Łukasz LangaSenior Systems Architecture EngineerIT Infrastructure DepartmentGrupa Allegro Sp. z o.o.Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość!Please consider the environment before printing out this e-mail.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Stefan Behnel

"Martin v. Löwis", 24.08.2011 20:15:

- issues to be considered (unclarities, bugs, limitations, ...)


A problem of the current implementation is the need for calling 
PyUnicode_(FAST_)READY(), and the fact that it can fail (e.g. due to 
insufficient memory). Basically, this means that even something as trivial 
as trying to get the length of a Unicode string can now result in an error.


I just noticed this when rewriting Cython's helper function that searches a 
unicode string for a (Py_UCS4) character. Previously, the entire function 
was safe, could never produce an error and therefore always returned a 
boolean result. In the new world, the caller of this function must check 
and propagate errors. This may not be a major issue in most cases, but it 
can have a non-trivial impact on user code, depending on how deep in a call 
chain this happens and on how much control the user has over the call chain 
(think of a C callback, for example).


Also, even in the case that there is no error, the potential need to build 
up the string on request means that the run time and memory requirements of 
an algorithm are less predictable now as they depend on the origin of the 
input and not just its Python level string content.


I would be happier with an implementation that avoided this by always 
instantiating the data buffer right from the start, instead of carrying 
only a Py_UNICODE buffer for old-style instances.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Torsten Becker
Okay, I am convinced. :)   If Martin does not object, I would change
the "void *str" field to

 union {
 void *any;
 unsigned char *latin1;
 Py_UCS2 *ucs2;
 Py_UCS4 *ucs4;
 } data;


Regards,
Torsten


On Wed, Aug 24, 2011 at 02:57, Stefan Behnel  wrote:
> Torsten Becker, 24.08.2011 04:41:
>>
>> Also, common, now simple, checks for "unicode->str == NULL" would look
>> more ambiguous with a union ("unicode->str.latin1 == NULL").
>
> You could just add yet another field "any", i.e.
>
>    union {
>       unsigned char* latin1;
>       Py_UCS2* ucs2;
>       Py_UCS4* ucs4;
>       void* any;
>    } str;
>
> That way, the above test becomes
>
>    if (!unicode->str.any)
>
> or
>
>    if (unicode->str.any == NULL)
>
> Or maybe even call it "initialised" to match the intended purpose:
>
>    if (!unicode->str.initialised)
>
> That being said, I don't mind "unicode->str.latin1 == NULL" either, given
> that it will (as mentioned by others) be hidden behind a macro most of the
> time anyway.
>
> Stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] DNS problem with ar.pycon.org

2011-08-25 Thread Facundo Batista
Sorry for the crossposting, but I don't know who admins the pycon.org site.

it seems that something happened to "ar.pycon.org", it should point to
the same IP than "pycon.python.org.ar" (190.228.30.157).

Somebody knows who can fix it?

BTW, how do I update that page? We're having the third PyCon in
Argentina this year...

Thank you!

-- 

.    Facundo

Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] devguide: #12792: document the "type" field of the tracker.

2011-08-25 Thread Nick Coghlan
On Thu, Aug 25, 2011 at 9:59 PM, Nick Coghlan  wrote:
> A link to http://www.python.org/news/security/ would be handy here,
> since that has the GPG key to send encrypted messages to the security
> list.

http://www.python.org/security/ is a better variant of the link,
though (it redirects to the security advisory page, but looks nicer)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
 > > "Martin v. Löwis" writes:
 > > 
 > >  > No, that's explicitly *not* what C6 says. Instead, it says that a
 > >  > process that treats s1 and s2 differently shall not assume that others
 > >  > will do the same, i.e. that it is ok to treat them the same even though
 > >  > they have different code points. Treating them differently is also
 > >  > conforming.
 > > 
 > > Then what requirement does C6 impose, in your opinion? 
 > 
 > In IETF terminology, it's a weak SHOULD requirement. Unless there are
 > reasons not to, equivalent strings should be treated differently. It's
 > a weak requirement because the reasons not to treat them equivalent are
 > wide-spread.

There are no "weak SHOULDs" and no "wide-spread reasons" in RFC 2119.
RFC 2119 specifies "particular circumstances" and "full implications"
that are "carefully weighed" before varying from SHOULD behavior.

IMHO the Unicode Standard intends a full RFC 2119 "SHOULD" here.

 > Yes, but that's the operating system's choice first of all.  Some
 > operating systems do allow file names in a single directory that
 > are equivalent yet use different code points. Python then needs to
 > support this operating system, despite the permission of the
 > Unicode standard to ignore the difference.

Sure, and that's one of several such reasons why I think the PEP's
implementation of unicodes as arrays of code points is an optimal
balance.  But the Unicode standard does not "permit" ignoring the
difference here, except in the sense that *the Unicode standard
doesn't apply at all* and therefore doesn't forbid it.  The OSes in
question are not conforming processes, and presumably don't claim to
be.

Because most of the processes Python interacts with won't be
conforming processes (not even the majority of textual applications,
for a while), Python does not need to be, and *should not* be, a
conforming Unicode process for most of what it does.  Not even for
much of its text processing.

Also, to the extent that Python is a general-purpose language, I see
nothing wrong and lots of good in having a non-conformant code point
array type as the platform for implementing conforming Unicode
library(ies).

But this is not user/developer-friendly at all:

 > Wrt. normalization, I think all that's needed is already there.
 > Applications just need to normalize all strings to a normal form of
 > their liking, and be done. That's easier than using a separate library
 > throughout the code base (let alone using yet another string type).

But many users have never heard of normalization.  And that's *just*
normalization.  There is a whole raft of other requirements for
conformance (collation, case, etc).

The point is that with such a library and string type, various aspects
of conformance to Unicode, as well as conformance to associated
standards (eg, the dreaded UTS #18 ;-) can be added to the library
over time, and most users (those who don't need to squeeze every ounce
of performance out of Python) can be blissfully unaware of what, if
anything, they're conforming to.  Just upgrade the library to get the
best Unicode support (in terms of conformance) that Python has to
offer.

But for the reasons you (and Guido and Nick and ...) give, it's not
reasonable to put all that into core Python, not anytime soon.  Not to
mention that as a work-in-progress, it can hardly be considered stable
enough for the stdlib.

That is what Terry Reedy is getting at, AIUI.  "Batteries included"
should mean as much Unicode conformance as we can reasonably provide
should be *conveniently* available.  The ideal (given the caveat about
efficiency) would be *one* import statement and a ConformingUnicode type
that acts "just like a string" in all ways, except that (1) it indexes
and counts on characters (preferably "grapheme clusters" :-), (2) does
collation, regexps, and the like conformant to the Unicode standard,
and (3) may be quite inefficient from the point of view of bit-
shoveling net applications and the like.

Of course most of (2) is going to take quite a while, but (1) and (3)
should not be that hard to accomplish (especially (3) ;-).

 > > I'm simply saying that the current implementation of strings, as
 > > improved by PEP 393, can not be said to be conforming.
 > 
 > I continue to disagree. The Unicode standard deliberately allows
 > Python's behavior as conforming.

That's up to you.  I doubt very many users or application developers
will see it that way, though.  I think they would prefer that we be
conservative about what we call "conformant", and tell them precisely
what they need to do to get what they consider conformant behavior
from Python.  That's easier if we share definitions of conformant with
them.  And surely there would be great joy on the battlements if there
were a one-import way to spell "all the Unicode conformance you can
give me, please".

The problem with your legalistic approach, as I 

Re: [Python-Dev] [Python-checkins] devguide: #12792: document the "type" field of the tracker.

2011-08-25 Thread Nick Coghlan
On Tue, Aug 23, 2011 at 7:46 AM, ezio.melotti
 wrote:
> +security
> +    Issues that might have security implications.  If you think the issue
> +    should not be made public, please report it to secur...@python.org 
> instead.

A link to http://www.python.org/news/security/ would be handy here,
since that has the GPG key to send encrypted messages to the security
list.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Nick Coghlan
On Thu, Aug 25, 2011 at 7:57 PM, "Martin v. Löwis"  wrote:
> Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
>> I'm simply saying that the current
>> implementation of strings, as improved by PEP 393, can not be said to
>> be conforming.
>
> I continue to disagree. The Unicode standard deliberately allows
> Python's behavior as conforming.

I'd actually put it slightly differently: it seems to me that Python,
in and of itself, can neither conform to nor violate that part of the
standard, since conformance depends on how the *application* processes
the data.

However, we can make it harder or easier for applications to be
conformant. UCS2 builds make it harder, since some code points have to
be represented as code units internally. UCS4 builds and future PEP
393 builds (which should exhibit current UCS4 build semantics at the
Python layer) make it easier, since the internal representation
consistently uses code points, with code units only appearing as part
of the encoding and decoding process.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Antoine Pitrou

Hello,

On Thu, 25 Aug 2011 10:24:39 +0200
"Martin v. Löwis"  wrote:
> 
> On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length
> 1 (+NUL) will take the same memory either way: 8 bytes for the
> characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings
> of 2 or more characters will take more space in 3.2.
> 
> On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up
> to 3 characters take the same space either way; space savings start at
> four characters.
> 
> On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum
> block size of 16 bytes, pure-ASCII strings of up to 7 characters take
> the same space. For 8 characters, 3.2 will need 32 bytes for the
> characters, whereas 3.3 will only take 16 bytes (due to padding).

That's very good. For future reference, could you add this information
to the PEP?

> >> - conditions you would like to pose on the implementation before
> >>   acceptance. I'll see which of these can be resolved, and list
> >>   the ones that remain open.
> > 
> > That it doesn't significantly slow down benchmarks such as stringbench
> > and iobench.
> 
> Can you please quantify "significantly"? Also, having a complete list
> of benchmarks to perform prior to acceptance would be helpful.

I would say no more than a 15% slowdown on each of the following
benchmarks:

- stringbench.py -u
  (http://svn.python.org/view/sandbox/trunk/stringbench/)
- iobench.py -t
  (in Tools/iobench/)
- the json_dump, json_load and regex_v8 tests from
  http://hg.python.org/benchmarks/

I believe these are representative of string-heavy operations.

Additionally, it would be nice if you could run at least some of the
test_bigmem tests, according to your system's available RAM.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
> "Martin v. Löwis" writes:
> 
>  > No, that's explicitly *not* what C6 says. Instead, it says that a
>  > process that treats s1 and s2 differently shall not assume that others
>  > will do the same, i.e. that it is ok to treat them the same even though
>  > they have different code points. Treating them differently is also
>  > conforming.
> 
> Then what requirement does C6 impose, in your opinion? 

In IETF terminology, it's a weak SHOULD requirement. Unless there are
reasons not to, equivalent strings should be treated differently. It's
a weak requirement because the reasons not to treat them equivalent are
wide-spread.

> - Ideally, an implementation would *always* interpret two
>   canonical-equivalent sequences *identically*.  There are practical
>   circumstances under which implementations may reasonably distinguish
>   them.  (Emphasis mine.)

Ok, so let me put emphasis on *ideally*. They acknowledge that for
practical reasons, the equivalent strings may need to be
distinguished.

> The examples given are things like "inspecting memory representation
> structure" (which properly speaking is really outside of Unicode
> conformance) and "ignoring collation behavior of combining sequences
> outside the repertoire of a specified language."  That sounds like
> "Special cases aren't special enough to break the rules. Although
> practicality beats purity." to me.  Treating things differently is an
> exceptional case, that requires sufficient justification.

And the common justification is efficiency, along with the desire
to support the representation of unnormalized strings (else there
would be an efficient implementation).

> If our process is working with an external process (the OS's file
> system driver) whose definition includes the statement that "File
> names are sequences of Unicode characters", then C6 says our process
> must compare canonically equivalent sequences that it takes to be file
> names as the same, whether or not they are in the same normalized
> form, or normalized at all, because we can't assume the file system
> will treat them as different.

It may well happen that this requirement is met in a plain Python
application. If the file system and GUI libraries always return
NFD strings, then the Python process *will* compare equivalent
sequences correctly (since it won't ever get any other
representations).

> *Users* will certainly take the viewpoint that two strings that
> display the same on their monitor should identify the same file when
> they use them as file names.

Yes, but that's the operating system's choice first of all.
Some operating systems do allow file names in a single directory
that are equivalent yet use different code points. Python then
needs to support this operating system, despite the permission of the
Unicode standard to ignore the difference.

> I'm simply saying that the current
> implementation of strings, as improved by PEP 393, can not be said to
> be conforming.

I continue to disagree. The Unicode standard deliberately allows
Python's behavior as conforming.

> I would like to see something much more conformant done as a separate
> library (the Python Components for Unicode, say), intended to support
> users who need character-based behavior, Unicode-ly correct collation,
> etc., more than efficiency.

Wrt. normalization, I think all that's needed is already there.
Applications just need to normalize all strings to a normal form of
their liking, and be done. That's easier than using a separate library
throughout the code base (let alone using yet another string type).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > No, that's explicitly *not* what C6 says. Instead, it says that a
 > process that treats s1 and s2 differently shall not assume that others
 > will do the same, i.e. that it is ok to treat them the same even though
 > they have different code points. Treating them differently is also
 > conforming.

Then what requirement does C6 impose, in your opinion?  It sounds like
you don't think it imposes any, in practice.

Note that in the discussion of C6, the standard says,

- Ideally, an implementation would *always* interpret two
  canonical-equivalent sequences *identically*.  There are practical
  circumstances under which implementations may reasonably distinguish
  them.  (Emphasis mine.)

The examples given are things like "inspecting memory representation
structure" (which properly speaking is really outside of Unicode
conformance) and "ignoring collation behavior of combining sequences
outside the repertoire of a specified language."  That sounds like
"Special cases aren't special enough to break the rules. Although
practicality beats purity." to me.  Treating things differently is an
exceptional case, that requires sufficient justification.

My understanding is that if those strings are exchanged with an
another process, then whether or not treating them differently is
allowed depends on whether the results will be output to another
process, and what the definition of our process is.  Sometimes it will
be allowed, but mostly it won't.  Take file names as an example.

If our process is working with an external process (the OS's file
system driver) whose definition includes the statement that "File
names are sequences of Unicode characters", then C6 says our process
must compare canonically equivalent sequences that it takes to be file
names as the same, whether or not they are in the same normalized
form, or normalized at all, because we can't assume the file system
will treat them as different.  If we do treat them as different, our
users will get very upset (eg, if we don't signal a duplicate file
name input by the user, and then the OS proceeds to overwrite an
existing file).

Dually, having made the statement that file names are Unicode, C6 says
that the OS driver must return the same file given two canonically
equivalent strings that happen to have different code points in them,
because it may not assume that *we* will treat those strings as
different names of different files.

*Users* will certainly take the viewpoint that two strings that
display the same on their monitor should identify the same file when
they use them as file names.

Now, I'm *not* saying that Python's strings *should* conform to the
Unicode standard in this respect yet (or ever, for that matter; I'm
with Guido on that).  I'm simply saying that the current
implementation of strings, as improved by PEP 393, can not be said to
be conforming.

I would like to see something much more conformant done as a separate
library (the Python Components for Unicode, say), intended to support
users who need character-based behavior, Unicode-ly correct collation,
etc., more than efficiency.  Applications that need both will have to
make their own way at first, either by contributing improvements to
the library or by using application-specific algorithms.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Victor Stinner

Le 25/08/2011 06:46, Stefan Behnel a écrit :

Conversion to wchar_t* is common, especially on Windows.


That's an issue. However, I cannot say how common this really is in
practice. Surely depends on the specific code, right? How common is it
in core CPython?


Quite all functions taking text as argument on Windows expects wchar_t* 
strings (UTF-16). In Python, we pass a "Py_UNICODE*" 
(PyUnicode_AS_UNICODE or PyUnicode_AsUnicode) because Py_UNICODE is 
wchar_t on Windows.


Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Victor Stinner

Le 25/08/2011 06:12, Stephen J. Turnbull a écrit :

  >  Let's take small steps. Do the evolutionary thing. Let's get things
  >  right so users won't have to worry about code points vs. code units
  >  any more. A conforming library for all things at the character level
  >  can be developed later, once we understand things better at that level
  >  (again, most developers don't even understand most of the subtleties,
  >  so I claim we're not ready).

I don't think anybody does.  That's one reason there's a new version
of Unicode every few years.


It took some weeks (months?) to write the PEP, and months to implement 
it. This PEP is only a minor change of the implementation of Unicode in 
Python. A larger change will take much more time (and maybe change/break 
the C and/or Python API a little bit more).


If you are able to implement your specfication (a Unicode type with a 
"real" character API), please write a PEP and implement it. You may 
begin with a prototype in Python, and then rewrite it in C.


But I don't think that any core developer will do that for you. It's not 
how free software works. At least, I don't think that anyone will do 
that for free :-) (I bet that many developers will accept to implement 
that for money :-))


Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-25 Thread Martin v. Löwis
> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?

If I count correctly, it's only three *additional* words (compared to
3.2): four new ones, minus one that is removed. In addition, it drops
a memory block. Assuming a malloc overhead of two pointers per malloc
block, we get one additional pointer.

On a 32-bit machine with a 32-bit wchar_t, pure-ASCII strings of length
1 (+NUL) will take the same memory either way: 8 bytes for the
characters in 3.2, 2 bytes in 3.3 + extra pointer + padding. Strings
of 2 or more characters will take more space in 3.2.

On a 32-bit machine with a 16-bit wchar_t, pure-ASCII strings up
to 3 characters take the same space either way; space savings start at
four characters.

On a 64-bit machine with a 16-bit wchar_t, assuming a malloc minimum
block size of 16 bytes, pure-ASCII strings of up to 7 characters take
the same space. For 8 characters, 3.2 will need 32 bytes for the
characters, whereas 3.3 will only take 16 bytes (due to padding).

So: no, I can't see any adverse effects. Details depend on the
malloc implementation, though. A slight memory increase may occur
on compared to a narrow build may occur for strings that use
non-Latin-1, and a large increase for strings that use non-BMP
characters.

The real issue of memory consumption is the alternative representations,
if created. That applies for the default encoding in 3.2 as well as
the wchar_t and UTF-8 representations in 3.3.

> Are there any plans to make instantiation of small strings fast enough?
> Or is it already as fast as it should be?

I don't have any plans, and I don't see potential. Compared to 3.2, it
saves a malloc call, which may be quite an improvement. OTOH, it needs
to iterate over the characters twice, to find the largest character.

If you are referring to the reuse of Unicode objects: that's currently
not done, and is difficult to do in the 3.2 way due to the various sizes
of characters. One idea might be to only reuse UCS1 strings, and then
keep a freelist for these based on the string length.

> When interfacing with the Win32 "wide" APIs, what is the recommended
> way to get the required LPCWSTR?

As before: PyUnicode_AsUnicode.

> Will the format codes returning a Py_UNICODE pointer with
> PyArg_ParseTuple be deprecated?

Not for 3.3, no.

> Do you think the wstr representation could be removed in some future
> version of Python?

Yes. This probably has to wait for Python 4, though.

> Is PyUnicode_Ready() necessary for all unicode objects, or only those
> allocated through the legacy API?

Only for the latter (although it doesn't hurt to apply it to all
of them).

> “The Py_Unicode representation is not instantaneously available”: you
> mean the Py_UNICODE representation?

Thanks, fixed.

>> - conditions you would like to pose on the implementation before
>>   acceptance. I'll see which of these can be resolved, and list
>>   the ones that remain open.
> 
> That it doesn't significantly slow down benchmarks such as stringbench
> and iobench.

Can you please quantify "significantly"? Also, having a complete list
of benchmarks to perform prior to acceptance would be helpful.

Thanks,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
> What about things like the surrogateescape codec that
> deliberately use code units in non-standard ways? Will
> tricks like that still be possible if the code-unit
> level is hidden from the programmer?

Most certainly. In the PEP-393 representation, the surrogate
characters can readily be represented (and would imply atleast
the two-byte form), but they will never take their UTF-16
function (i.e. the UTF-8 codec won't try to combine surrogate
pairs), so they can be used for surrogateescape and other
functions. Of course, in strict error mode, codecs will
refuse to encode them (notice that surrogateescape is an error
handler, not a codec).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
>  > What is non-conforming about comparing two code points?
> 
> Unicode conformance means treating characters correctly.

Re-read the text. You are interpreting something that isn't there.


>  > Seriously, what does Unicode-conforming mean here?
> 
> Chapter 3, all verses.  Here, specifically C6, p. 60.  One would have
> to define the process executing "s1[0] == s2[0]" to be sure that even
> in the cases cited in the previous paragraph non-conformance is
> occurring

No, that's explicitly *not* what C6 says. Instead, it says that a
process that treats s1 and s2 differently shall not assume that others
will do the same, i.e. that it is ok to treat them the same even though
they have different code points. Treating them differently is also
conforming.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Martin v. Löwis
>>Strings contain Unicode code units, which for most purposes can be
>>treated as Unicode characters.  However, even as "simple" an
>>operation as "s1[0] == s2[0]" cannot be relied upon to give
>>Unicode-conforming results.
>>
>> The second sentence remains true under PEP 393.
> 
> Really? If strings contain code units, that expression compares code
> units. What is non-conforming about comparing two code points? They
> are just integers.
> 
> Seriously, what does Unicode-conforming mean here?

I think he's referring to combining characters and normal forms. 2.12
starts with

"In cases involving two or more sequences considered to be equivalent,
the Unicode Standard does not prescribe one particular sequence as being
the  correct one; instead, each  sequence is merely equivalent to the
others"

That could be read to imply that the == operator should determine
whether two strings are equivalent. However, the Unicode standard
clearly leaves API design to the programming environment, and has
the notion of conformance only for processes. So saying that Python
is or is not unicode-conforming is, strictly speaking, meaningless.

The closest conformance requirement in that respect is C6

"A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct."

However, that explicitly does *not* support the conformance statement
that Stephen made. They elaborate

"Ideally, an implementation would always interpret two
canonical-equivalent  character sequences identically. There are
practical circumstances under which  implementations may reasonably
distinguish them."

So practicality beats purity even in Unicode conformance: the
== operator of Python can reasonably treat equivalent strings
as unequal (and there is a good reason for that, indeed). Processes
should not expect that other applications make the same distinction,
so they need to cope if it matters to them. There are different way
to do that:
- normalize all strings on input, and then use ==
- use a different comparison operation that always normalizes
  its input first

> This I agree with (though if you were referring to me with
> "leadership" I consider myself woefully underinformed about Unicode
> subtleties). I also suspect that Unicode "conformance" (however
> defined) is more part of a political battle than an actual necessity.

Fortunately, it's much better than that. Unicode had very clear
conformance requirements for a long time, and they aren't hard
to meet.

Wrt. C6, Python could certainly improve, e.g. by caching whether
a string had been determined to be in normal form, so that applications
can more reasonably apply normalization to all strings they ever
want to compare.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-25 Thread Greg Ewing

On 25/08/11 14:29, Guido van Rossum wrote:

Let's get things
right so users won't have to worry about code points vs. code units
any more.


What about things like the surrogateescape codec that
deliberately use code units in non-standard ways? Will
tricks like that still be possible if the code-unit
level is hidden from the programmer?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-25 Thread John O'Connor
+1 FileSystemError - For already stated reasons.

- John
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com