date:20110826

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Terry Reedy


On 8/26/2011 9:56 PM, Antoine Pitrou wrote:


Another "interesting" question is whether it's easy to port to the PEP
393 string representation, if it gets accepted.


Will the re module need porting also?

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Martin v. Löwis

> I can't either, but ISTR hearing that from __future__ import was started
> with such an intent. 

No, not at all. The original intention was to enable features that would
definitely would be added, not just right now. Tim Peters always
objected to claims that future imports were talking about provisional
features.

> Politically, and from a marketing standpoint, it's easier to withdraw
> a feature you've given with a "Play with this, see if it works for
> you" warning.

We don't want to add features to Python that we may have to withdraw.
If there is doubt whether they should be added, they shouldn't be added.
If they do get added, we have to live with it (until, say, Python 4,
where bad features can be removed again).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Terry Reedy


On 8/26/2011 8:23 PM, Antoine Pitrou wrote:


I would only agree as long as it wasn't too much worse
than O(1). O(log n) might be all right, but O(n) would be
unacceptable, I think.


It also depends a lot on *actual* measured performance


Amen. Some regard O(n*n) sorts to be, by definition, 'worse' than 
O(n*logn). I even read that in an otherwise good book by a university 
professor. Fortunately for Python users, Tim Peters ignored that 
'wisdom', coded the best O(n*n) sort he could, and then *measured* to 
find out what was better for what types and lengths of arrays. So not we 
have a list.sort that sometimes beats the pure O(nlog) quicksort of C 
libraries.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Martin v. Löwis

> I'm not sure it's worth doing an extensive review of the code, a better
> approach might be to require extensive test coverage  (and a review of
> tests).

I think it's worth. It's really bad if only one developer fully
understands the regex implementation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Dan Stromberg

On Fri, Aug 26, 2011 at 8:47 PM, Steven D'Aprano wrote:

> Antoine Pitrou wrote:
>
>> On Fri, 26 Aug 2011 17:25:56 -0700
>> Dan Stromberg  wrote:
>>
> If you add regex as "import regex", and the new regex module doesn't work
>
>> out, regex might be harder to get rid of.  from __future__ import is an
>>> established way of trying something for a while to see if it's going to
>>> work.
>>>
>>
>> That's an interesting idea. This way, integrating the new module would
>> be a less risky move, since if it gives us too many problems, we could
>> back out our decision in the next feature release.
>>
>
> I'm not sure that's correct. If there are differences in either the
> interface or the behaviour between the new regex and re, then reverting will
> be a pain regardless of whether you have:
>
> from __future__ import re
> re.compile(...)
>
> or
>
> import regex
> regex.compile(...)
>
>
> Either way, if the new regex library goes away, code will break, and fixing
> it may not be easy.

You're talking technically, which is important, but wasn't what I was
suggesting would be helped.

Politically, and from a marketing standpoint, it's easier to withdraw a
feature you've given with a "Play with this, see if it works for you"
warning.

Have then been any __future__ features that were added provisionally?
>

I can't either, but ISTR hearing that from __future__ import was started
with such an intent.  Irrespective, it's hard to import something from
"future" without at least suspecting that you're on the bleeding edge.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Raymond Hettinger


On Aug 26, 2011, at 8:51 PM, Terry Reedy wrote:

> 
> 
> On 8/26/2011 8:42 PM, Guido van Rossum wrote:
>> On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy  wrote:
> 
>>> My impression is that a UFT-16 implementation, to be properly called such,
>>> must do len and [] in terms of code points, which is why Python's narrow
>>> builds are called UCS-2 and not UTF-16.
>> 
>> I don't think anyone else has that impression. Please cite chapter and
>> verse if you really think this is important. IIUC, UCS-2 does not
>> allow surrogate pairs, whereas Python (and Java, and .NET, and
>> Windows) 16-bit strings all do support surrogate pairs. And they all
> 
> For that reason, I think UTF-16 is a better term that UCS-2 for narrow builds 
> (whether or not the above impression is true).

I agree.  It's weird to call something UCS-2 if code points above 65535 are 
representable.
The naming convention for codecs is that the UTF prefix is used for lossless 
encodings that cover the entire range of Unicode.

"The first amendment to the original edition of the UCS defined UTF-16, an 
extension of UCS-2, to represent code points outside the BMP."

Raymond

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Sphinx version for Python 2.x docs

2011-08-26 Thread Georg Brandl

Am 23.08.2011 01:09, schrieb Sandro Tosi:
> Hi all,
> 
>> Any chance the version of sphinx used to generate the docs on
>> docs.python.org could be updated?
> 
> I'd like to discuss this aspect, in particular for the implication it
> has on http://bugs.python.org/issue12409 .
> 
> Personally, I do think it has a value to have the same set of tools to
> build the Python documentation of the currently active branches.
> Currently, only 2.7 is different, since it still fetches (from
> svn.python.org... can we fix this too? suggestions welcome!) sphinx
> 0.6.7 while 3.2/3.3 uses 1.0.7.
> 
> If you're worried about the time needed to convert the actual 2.7 doc
> to new sphinx format and all the related changes, I volunteer to do
> the job (and/or collaborate with whom is already on it), but what I
> want to understand if it's an acceptable change.
> 
> I see sphinx more as of an internal, building tool, so freezing it
> it's like saying "don't upgrade gcc" or so. Now the delta is just the
> C functions definitions and some py-specific roles, but during the
> years it will increase. Keeping it small, simplifying the forward-port
> of doc patches (not needing to have 2 version between 2.7 and 3.x
> f.e.) and having a common set of tools for doc building is worth IMHO.
> 
> What do you think about it? and yes Georg, I'd like to hear your opinion too 
> :)

One of the main reasons for keeping Sphinx compatibility to 0.6.x was to
enable distributions (like Debian) to build the docs for the Python they ship
with the version of Sphinx that they ship.

This should now be fine with 1.0.x, so since you are ready to do the work of
converting the 2.7 Doc sources, it will be accepted.  The argument of easier
backports is a very good one.

The issue of using svn to download the tools is orthogonal; for this I would
agree to just packaging up a tarball or zipfile that is then downloaded using a
small Python script (should be properly cross-platform then).  Cloning the
original repositories is a) not useful, b) depends on availability of at least
two additional servers (remember docutils) and c) requires hg and svn.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Terry Reedy




On 8/26/2011 8:42 PM, Guido van Rossum wrote:

On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy  wrote:



My impression is that a UFT-16 implementation, to be properly called such,
must do len and [] in terms of code points, which is why Python's narrow
builds are called UCS-2 and not UTF-16.


I don't think anyone else has that impression. Please cite chapter and
verse if you really think this is important. IIUC, UCS-2 does not
allow surrogate pairs, whereas Python (and Java, and .NET, and
Windows) 16-bit strings all do support surrogate pairs. And they all


For that reason, I think UTF-16 is a better term that UCS-2 for narrow 
builds (whether or not the above impression is true).

But Marc Lemburg disagrees.
http://mail.python.org/pipermail/python-dev/2010-November/105751.html
The 2.7 docs still refer to usc2 builds, as is his wish.

---
Terry Jan Reedy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Steven D'Aprano


Antoine Pitrou wrote:

On Fri, 26 Aug 2011 17:25:56 -0700
Dan Stromberg  wrote:

[...]

If you add regex as "import regex", and the new regex module doesn't work
out, regex might be harder to get rid of.  from __future__ import is an
established way of trying something for a while to see if it's going to
work.


That's an interesting idea. This way, integrating the new module would
be a less risky move, since if it gives us too many problems, we could
back out our decision in the next feature release.


I'm not sure that's correct. If there are differences in either the 
interface or the behaviour between the new regex and re, then reverting 
will be a pain regardless of whether you have:


from __future__ import re
re.compile(...)

or

import regex
regex.compile(...)


Either way, if the new regex library goes away, code will break, and 
fixing it may not be easy. It's not likely to be so easy that merely 
deleting the "from __future__ ..." line will do it, but if it is that 
easy, then using "import re as regex" will be just as easy.


Have then been any __future__ features that were added provisionally? I 
can't think of any. That's not what __future__ is for, at least 
according to PEP 236.


http://www.python.org/dev/peps/pep-0236/

I can't think of any __future__ feature that could be easily reverted 
once people start relying on it. Either syntax would break, or behaviour 
would change.


The PEP even explicitly states that __future__ should not be used for 
changes which are backward compatible:


Note that there is no need to involve the future_statement machinery
in new features unless they can break existing code; fully backward-
compatible additions can-- and should --be introduced without a
corresponding future_statement.


I wasn't around for the move from 1.4 regex to 1.5 re, so I don't know 
what was done poorly last time. But I can't see why we should treat 
regular expressions so differently from (say) argparse and optparse.


from __future__ import optparse

No. Just... no.




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Steven D'Aprano


Ben Finney wrote:

Steven D'Aprano  writes:


Ben Finney wrote:

"M.-A. Lemburg"  writes:

No, you tell them: "If you want Unicode 6 semantics, use regex, if
you're fine with Unicode 2.0/3.0 semantics, use re".

What do we say, then, to those who are unaware of the different
semantics between those versions of Unicode, and want regular expression
to “just work” in Python?

To which document can we direct them to understand what semantics they
want?

Presumably, like all modules, both the re and the regex module will
have their own individual pages in the library reference.


My question is directed more to M-A Lemburg's passage above, and its
implicit assumption that the user understand the changes between
“Unicode 2.0/3.0 semantics” and “Unicode 6 semantics”, and how their own
needs relate to those semantics.

For programmers who know they want to follow Unicode conventions in
Python, but don't know the distinction M-A Lemburg is drawing, to which
document does he recommend we direct them?



I can only repeat my answer: the docs for the new regex module should 
include a discussion of the differences. If that requires summarising 
the differences that M-A Lemburg refers to, then so be it.




“The Unicode specification document in its various versions” isn't a
feasible answer.


Presumably the Unicode spec will be the canonical source, but I agree 
that we should not expect people to read that in order to make a 
decision between re and regex.



--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Ben Finney

Steven D'Aprano  writes:

> Ben Finney wrote:
> > "M.-A. Lemburg"  writes:
>
> >> No, you tell them: "If you want Unicode 6 semantics, use regex, if
> >> you're fine with Unicode 2.0/3.0 semantics, use re".
> >
> > What do we say, then, to those who are unaware of the different
> > semantics between those versions of Unicode, and want regular expression
> > to “just work” in Python?
> >
> > To which document can we direct them to understand what semantics they
> > want?
>
> Presumably, like all modules, both the re and the regex module will
> have their own individual pages in the library reference.

My question is directed more to M-A Lemburg's passage above, and its
implicit assumption that the user understand the changes between
“Unicode 2.0/3.0 semantics” and “Unicode 6 semantics”, and how their own
needs relate to those semantics.

For programmers who know they want to follow Unicode conventions in
Python, but don't know the distinction M-A Lemburg is drawing, to which
document does he recommend we direct them?

“The Unicode specification document in its various versions” isn't a
feasible answer.

-- 
 \ “Computers are useless. They can only give you answers.” —Pablo |
  `\   Picasso |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Fri, 26 Aug 2011 17:25:56 -0700
Dan Stromberg  wrote:
> On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou  wrote:
> 
> > On Fri, 26 Aug 2011 15:48:42 -0700
> > Dan Stromberg  wrote:
> > >
> > > Then there probably should be a from __future__ import for a while.
> >
> > If you are willing to use a "from __future__ import", why not simply
> >
> >import regex as re
> >
> > ? We're not Perl, we don't have built-in syntactic support for regular
> > expressions.
> >
> > Regards
> >
> 
> If you add regex as "import regex", and the new regex module doesn't work
> out, regex might be harder to get rid of.  from __future__ import is an
> established way of trying something for a while to see if it's going to
> work.

That's an interesting idea. This way, integrating the new module would
be a less risky move, since if it gives us too many problems, we could
back out our decision in the next feature release.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Sat, 27 Aug 2011 04:37:21 +0300
Ezio Melotti  wrote:
> 
> I'm not sure it's worth doing an extensive review of the code, a better
> approach might be to require extensive test coverage  (and a review of
> tests).  If the code seems well written, commented, documented (I think
> proper rst documentation is still missing),

Isn't this precisely what a review is supposed to assess?

> We will get familiar with the code once we start contributing
> to it and fixing bugs, as it already happens with most of the other modules.

I'm not sure it's a good idea for a module with more than 1 lines
of C code (and 4000 lines of pure Python code). This is several times
the size of multiprocessing. The C code looks very cleanly written, but
it's still a big chunk of algorithmically sophisticated code.

Another "interesting" question is whether it's easy to port to the PEP
393 string representation, if it gets accepted.

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Steven D'Aprano


Ben Finney wrote:

"M.-A. Lemburg"  writes:



No, you tell them: "If you want Unicode 6 semantics, use regex, if
you're fine with Unicode 2.0/3.0 semantics, use re".


What do we say, then, to those who are unaware of the different
semantics between those versions of Unicode, and want regular expression
to “just work” in Python?

To which document can we direct them to understand what semantics they
want?


Presumably, like all modules, both the re and the regex module will have 
their own individual pages in the library reference. As the newcomer, 
regex should include a discussion of differences between the two. This 
can then be quietly dropped once re becomes formally deprecated.


(Assuming that the std lib keeps re and regex in parallel for a few 
releases, which is not a given.)


However, I note that last time, the old regex module was just documented 
as obsolete with little detailed discussion of the differences:


http://docs.python.org/release/1.5/lib/node69.html#SECTION00530


--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Ezio Melotti

On Sat, Aug 27, 2011 at 1:57 AM, Guido van Rossum  wrote:

> On Fri, Aug 26, 2011 at 3:54 PM, "Martin v. Löwis" 
> wrote:
> > [...]
> > Among us, some are more "regex gurus" than others; you know
> > who you are. I guess the PSF would pay for the review, if that
> > is what it would take.
>
> Makes sense. I noticed Ezio seems quite in favor of regex. Maybe he knows
> more?
>

Matthew has always been responsive on the tracker, usually fixing reported
bugs in a matter of days, and I think he's willing to keep doing so once the
regex module is included.  Even if I haven't yet tried the module myself
(I'm planning to do it though), it seems quite popular out there (the
download number on PyPI apparently gets reset for each new release, so I
don't know the exact total), and apparently people are already using it as a
replacement of re.

I'm not sure it's worth doing an extensive review of the code, a better
approach might be to require extensive test coverage  (and a review of
tests).  If the code seems well written, commented, documented (I think
proper rst documentation is still missing), and tested (both with unittest
and out in the wild), and Matthew is willing to maintain it, I think we can
include it.  We will get familiar with the code once we start contributing
to it and fixing bugs, as it already happens with most of the other modules.

See also the "New regex module for 3.2?" thread (
http://mail.python.org/pipermail/python-dev/2010-July/101606.html ).

Best Regards,
Ezio Melotti

>
> --
> --Guido van Rossum (python.org/~guido )
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Ben Finney

"M.-A. Lemburg"  writes:

> Guido van Rossum wrote:

> > I really don't want to have to tell people "Oh, that bug is fixed
> > but you have to use regex instead of re" and then a few years later
> > have to tell them "Oh, we're deprecating regex, you should just use
> > re".
>
> No, you tell them: "If you want Unicode 6 semantics, use regex, if
> you're fine with Unicode 2.0/3.0 semantics, use re".

What do we say, then, to those who are unaware of the different
semantics between those versions of Unicode, and want regular expression
to “just work” in Python?

To which document can we direct them to understand what semantics they
want?

> After all, it's not like re suddenly stopped working :-)

For some value of “working”, that is. The trick is to know whether that
value is what one wants.

-- 
 \“The fact of your own existence is the most astonishing fact |
  `\you'll ever have to confront. Don't dare ever see your life as |
_o__)boring, monotonous, or joyless.” —Richard Dawkins, 2010-03-10 |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy  wrote:
>
>
> On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
>>>
>>> IronPython and Jython can retain UTF-16 as their native form if that
>>> makes interop cleaner, but in doing so they need to ensure that basic
>>> operations like indexing and len work in terms of code points, not
>>> code units, if they are to conform.
>
> My impression is that a UFT-16 implementation, to be properly called such,
> must do len and [] in terms of code points, which is why Python's narrow
> builds are called UCS-2 and not UTF-16.

I don't think anyone else has that impression. Please cite chapter and
verse if you really think this is important. IIUC, UCS-2 does not
allow surrogate pairs, whereas Python (and Java, and .NET, and
Windows) 16-bit strings all do support surrogate pairs. And they all
have a len or length function that counts code units, not code points.

>> That means that they won't conform, period. There is no efficient
>> maintainable implementation strategy to achieve that property,
>
> Given that both 'efficient' and 'maintainable' are relative terms, that is
> you pessimistic opinion, not really a fact.
>
>> it may take well years until somebody provides an efficient
>> unmaintainable implementation.
>>
>>> Does this make sense, or have I completely misunderstood things?
>>
>> You seem to assume it is ok for Jython/IronPython to provide indexing in
>> O(n). It is not.
>
> Why do you keep saying that O(n) is the alternative? I have already given a
> simple solution that is O(logk), where k is the number of non-BMP
> characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise
> (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I
> suspect that is the most time efficient possible without using at least as
> much space as a UCS-4 solution. The fact that you and other do not want this
> for CPython should not preclude other implementations that are more tied to
> UTF-16 from exploring the idea.
>
> Maintainability partly depends on whether all-codepoint support is built in
> or bolted on to a BMP-only implementation burdened with back compatibility
> for a code unit API. Maintainability is probably harder with a separate
> UTF-32 type, which CPython has but which I gather Jython and Iron-Python do
> not. It might or might not be easier is there were a separate internal
> character type containing a 32 bit code point value, so that interation and
> indexing (and single char slicing) always returned the same type of object
> regardless of whether the character was in the BMP or not. This certainly
> would help all the unicode database functions.
>
> Tom Christiansen appears to have said that Perl is or will use UTF-8 plus
> auxiliary arrays. If so, we will find out if they can maintain it.

Their API style is completely different from ours. What Perl can
maintain has little bearing on what Python can.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Greg Ewing


M.-A. Lemburg wrote:

Simply going with UCS-4 does not solve the problem, since
even with UCS-4 storage, you can still have surrogates in your
Python Unicode string.


Yes, but in that case, you presumably *intend* them to
be treated as separate indexing units. If you didn't,
there would be no need to use surrogates in the first
place.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Antoine Pitrou

On Sat, 27 Aug 2011 12:17:18 +1200
Greg Ewing  wrote:
> Paul Moore wrote:
> 
> > IronPython and Jython can retain UTF-16 as their native form if that
> > makes interop cleaner, but in doing so they need to ensure that basic
> > operations like indexing and len work in terms of code points, not
> > code units, if they are to conform. ... They lose the O(1)
> > guarantee, but that's easily defensible as a tradeoff to conform to
> > underlying runtime semantics.
> 
> I would only agree as long as it wasn't too much worse
> than O(1). O(log n) might be all right, but O(n) would be
> unacceptable, I think.

It also depends a lot on *actual* measured performance. As someone
mentioned in the tracker, the index you use on a string usually comes
from a previous string operation (like a search), perhaps with a small
offset. So a caching scheme may actually give very good results with a
rather small overhead (you could cache, say, the 4 most recent indices
and choose the nearest when an indexing operation is done; with utf-8,
scanning backward and forward is equally simple).

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Dan Stromberg

On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou  wrote:

> On Fri, 26 Aug 2011 15:48:42 -0700
> Dan Stromberg  wrote:
> >
> > Then there probably should be a from __future__ import for a while.
>
> If you are willing to use a "from __future__ import", why not simply
>
>import regex as re
>
> ? We're not Perl, we don't have built-in syntactic support for regular
> expressions.
>
> Regards
>

If you add regex as "import regex", and the new regex module doesn't work
out, regex might be harder to get rid of.  from __future__ import is an
established way of trying something for a while to see if it's going to
work.

EG: "from __future__ import re", where re is really the new module.

But whatever.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Greg Ewing


Paul Moore wrote:


IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform. ... They lose the O(1)
guarantee, but that's easily defensible as a tradeoff to conform to
underlying runtime semantics.


I would only agree as long as it wasn't too much worse
than O(1). O(log n) might be all right, but O(n) would be
unacceptable, I think.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Fri, 26 Aug 2011 15:48:42 -0700
Dan Stromberg  wrote:
> 
> Then there probably should be a from __future__ import for a while.

If you are willing to use a "from __future__ import", why not simply

import regex as re

? We're not Perl, we don't have built-in syntactic support for regular
expressions.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Sat, 27 Aug 2011 01:00:31 +0200
"M.-A. Lemburg"  wrote:
> > 
> > I can't say I liked how that transition was handled last time around.
> > I really don't want to have to tell people "Oh, that bug is fixed but
> > you have to use regex instead of re" and then a few years later have
> > to tell them "Oh, we're deprecating regex, you should just use re".
> 
> No, you tell them: "If you want Unicode 6 semantics, use regex,
> if you're fine with Unicode 2.0/3.0 semantics, use re". After all,
> it's not like re suddenly stopped working :-)

It has a whole lot of new features in addition to better unicode
support. See for yourself:
https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails

> Perhaps we could have a summer of code student do a review and
> analysis to get familiar with the code and then have at least
> two developers know the code well enough to support it for
> a while.

I'm not sure a GSoC student would be the best candidate to do a review
matching our expectations.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Fri, 26 Aug 2011 15:47:21 -0700
Guido van Rossum  wrote:
> > The best way would be to contact the author, Matthew Barnett,
> 
> I had added him to the beginning of this thread but someone took him off.
> 
> > or to ask
> > on the tracker on http://bugs.python.org/issue2636. He has been quite
> > willing to answer such questions in the past, AFAIR.
> 
> So, that issue is about something called "regexp". AFAIK Matthew
> (MRAB) wrote something called "regex"
> (http://pypi.python.org/pypi/regex). Are they two different things???

No, it's the same.  The source is at
https://code.google.com/p/mrab-regex-hg/, btw.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Terry Reedy




On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:

IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform.


My impression is that a UFT-16 implementation, to be properly called 
such, must do len and [] in terms of code points, which is why Python's 
narrow builds are called UCS-2 and not UTF-16.



That means that they won't conform, period. There is no efficient
maintainable implementation strategy to achieve that property,


Given that both 'efficient' and 'maintainable' are relative terms, that 
is you pessimistic opinion, not really a fact.



it may take well years until somebody provides an efficient
unmaintainable implementation.


Does this make sense, or have I completely misunderstood things?


You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.


Why do you keep saying that O(n) is the alternative? I have already 
given a simple solution that is O(logk), where k is the number of 
non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) 
otherwise (for all BMP chars). It uses O(k) space. I think that is 
pretty efficient. I suspect that is the most time efficient possible 
without using at least as much space as a UCS-4 solution. The fact that 
you and other do not want this for CPython should not preclude other 
implementations that are more tied to UTF-16 from exploring the idea.


Maintainability partly depends on whether all-codepoint support is built 
in or bolted on to a BMP-only implementation burdened with back 
compatibility for a code unit API. Maintainability is probably harder 
with a separate UTF-32 type, which CPython has but which I gather Jython 
and Iron-Python do not. It might or might not be easier is there were a 
separate internal character type containing a 32 bit code point value, 
so that interation and indexing (and single char slicing) always 
returned the same type of object regardless of whether the character was 
in the BMP or not. This certainly would help all the unicode database 
functions.


Tom Christiansen appears to have said that Perl is or will use UTF-8 
plus auxiliary arrays. If so, we will find out if they can maintain it.


---
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Tom Christiansen

"M.-A. Lemburg"  wrote
   on Sat, 27 Aug 2011 01:00:31 +0200: 

> The good part is that it's based on the re code, the FUD comes
> from the fact that the new lib is 380kB larger than the old one
> and that's not even counting the generated 500kB of lookup
> tables.

Well, you have to put the property tables somewhere, somehow.
There are various schemes for demand loading them as needed,
but I don't know whether those are used.

--tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 4:21 PM, MRAB  wrote:
> On 27/08/2011 00:08, Tom Christiansen wrote:
>>
>> "M.-A. Lemburg"  wrote
>>    on Sat, 27 Aug 2011 01:00:31 +0200:
>>
>>> The good part is that it's based on the re code, the FUD comes
>>> from the fact that the new lib is 380kB larger than the old one
>>> and that's not even counting the generated 500kB of lookup
>>> tables.
>>
>> Well, you have to put the property tables somewhere, somehow.
>> There are various schemes for demand loading them as needed,
>> but I don't know whether those are used.
>>
> FYI, the .pyd for Python v3.2 is 227KB, about half of which is property
> tables.

I wouldn't hold the size of the generated tables against you. :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread MRAB


On 27/08/2011 00:08, Tom Christiansen wrote:

"M.-A. Lemburg"  wrote
on Sat, 27 Aug 2011 01:00:31 +0200:


The good part is that it's based on the re code, the FUD comes
from the fact that the new lib is 380kB larger than the old one
and that's not even counting the generated 500kB of lookup
tables.


Well, you have to put the property tables somewhere, somehow.
There are various schemes for demand loading them as needed,
but I don't know whether those are used.


FYI, the .pyd for Python v3.2 is 227KB, about half of which is property
tables.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread M.-A. Lemburg

Guido van Rossum wrote:
> On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg  wrote:
>> Guido van Rossum wrote:
>>> I just made a pass of all the Unicode-related bugs filed by Tom
>>> Christiansen, and found that in several, the response was "this is
>>> fixed in the regex module [by Matthew Barnett]". I started replying
>>> that I thought that we should fix the bugs in the re module (i.e.,
>>> really in _sre.c) but on second thought I wonder if maybe regex is
>>> mature enough to replace re in Python 3.3. It would mean that we won't
>>> fix any of these bugs in earlier Python versions, but I could live
>>> with that.
>>>
>>> However, I don't know much about regex -- how compatible is it, how
>>> fast is it (including extreme cases where the backtracking goes
>>> crazy), how bug-free is it, and so on. Plus, how much work would it be
>>> to actually incorporate it into CPython as a complete drop-in
>>> replacement of the re package (such that nobody needs to change their
>>> imports or the flags they pass to the re module).
>>>
>>> We'd also probably have to train some core developers to be familiar
>>> enough with the code to maintain and evolve it -- I assume we can't
>>> just volunteer Matthew to do so forever... :-)
>>>
>>> What's the alternative? Is adding the requested bug fixes and new
>>> features to _sre.c really that hard?
>>
>> Why not simply add the new lib, see whether it works out and
>> then decide which path to follow.
>>
>> We've done that with the old regex lib. It took a few years
>> and releases to have people port their applications to the
>> then new re module and syntax, but in the end it worked.
>>
>> With a new regex library there are likely going to be quite
>> a few subtle differences between re and regex - even if it's
>> just doing things in a more Unicode compatible way.
>>
>> I don't think anyone can actually list all the differences given
>> the complex nature of regular expressions, so people will
>> likely need a few years and releases to get used it before
>> a switch can be made.
> 
> I can't say I liked how that transition was handled last time around.
> I really don't want to have to tell people "Oh, that bug is fixed but
> you have to use regex instead of re" and then a few years later have
> to tell them "Oh, we're deprecating regex, you should just use re".

No, you tell them: "If you want Unicode 6 semantics, use regex,
if you're fine with Unicode 2.0/3.0 semantics, use re". After all,
it's not like re suddenly stopped working :-)

> I'm really hoping someone has more actual technical understanding of
> re vs. regex and can give us some facts about the differences, rather
> than, frankly, FUD.

The good part is that it's based on the re code, the FUD comes
from the fact that the new lib is 380kB larger than the old one
and that's not even counting the generated 500kB of lookup
tables.

If no one steps up to do a review or analysis, I think the
only practical way to test the lib is to give it a prominent
chance to prove itself.

The other aspect is maintenance.

Perhaps we could have a summer of code student do a review and
analysis to get familiar with the code and then have at least
two developers know the code well enough to support it for
a while.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 27 2011)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 3:54 PM, "Martin v. Löwis"  wrote:
>> However, I don't know much about regex
>
> The problem really is: nobody does (except for Matthew Barnett
> probably). This means that this contribution might be stuck
> "forever": somebody would have to review the module, identify
> issues, approve it, and take the blame if something breaks.
> That takes considerable time and has a considerable risk, for
> little expected glory - so nobody has volunteered to
> mentor/manage integration of that code.
>
> I believe most core contributors (who have run into this code)
> consider it worthwhile, but are just too scared to take action.
>
> Among us, some are more "regex gurus" than others; you know
> who you are. I guess the PSF would pay for the review, if that
> is what it would take.

Makes sense. I noticed Ezio seems quite in favor of regex. Maybe he knows more?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Martin v. Löwis

> However, I don't know much about regex

The problem really is: nobody does (except for Matthew Barnett
probably). This means that this contribution might be stuck
"forever": somebody would have to review the module, identify
issues, approve it, and take the blame if something breaks.
That takes considerable time and has a considerable risk, for
little expected glory - so nobody has volunteered to
mentor/manage integration of that code.

I believe most core contributors (who have run into this code)
consider it worthwhile, but are just too scared to take action.

Among us, some are more "regex gurus" than others; you know
who you are. I guess the PSF would pay for the review, if that
is what it would take.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Dan Stromberg

On Fri, Aug 26, 2011 at 2:45 PM, Guido van Rossum  wrote:

> ...but on second thought I wonder if maybe regex is
> mature enough to replace re in Python 3.3.
>

I agree that the move from regex to re was kind of painful.

It seems someone should merge the unit tests for re and regex, and apply the
merged result to each for the sake of comparison.  There might also be a
need to expand the merged result to include new things.

Then there probably should be a from __future__ import for a while.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 3:33 PM, Antoine Pitrou  wrote:
> On Fri, 26 Aug 2011 15:18:35 -0700
> Guido van Rossum  wrote:
>>
>> I can't say I liked how that transition was handled last time around.
>> I really don't want to have to tell people "Oh, that bug is fixed but
>> you have to use regex instead of re" and then a few years later have
>> to tell them "Oh, we're deprecating regex, you should just use re".
>>
>> I'm really hoping someone has more actual technical understanding of
>> re vs. regex and can give us some facts about the differences, rather
>> than, frankly, FUD.
>
> The best way would be to contact the author, Matthew Barnett,

I had added him to the beginning of this thread but someone took him off.

> or to ask
> on the tracker on http://bugs.python.org/issue2636. He has been quite
> willing to answer such questions in the past, AFAIR.

So, that issue is about something called "regexp". AFAIK Matthew
(MRAB) wrote something called "regex"
(http://pypi.python.org/pypi/regex). Are they two different things???

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Antoine Pitrou

On Fri, 26 Aug 2011 15:18:35 -0700
Guido van Rossum  wrote:
> 
> I can't say I liked how that transition was handled last time around.
> I really don't want to have to tell people "Oh, that bug is fixed but
> you have to use regex instead of re" and then a few years later have
> to tell them "Oh, we're deprecating regex, you should just use re".
> 
> I'm really hoping someone has more actual technical understanding of
> re vs. regex and can give us some facts about the differences, rather
> than, frankly, FUD.

The best way would be to contact the author, Matthew Barnett, or to ask
on the tracker on http://bugs.python.org/issue2636. He has been quite
willing to answer such questions in the past, AFAIR.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg  wrote:
> Guido van Rossum wrote:
>> I just made a pass of all the Unicode-related bugs filed by Tom
>> Christiansen, and found that in several, the response was "this is
>> fixed in the regex module [by Matthew Barnett]". I started replying
>> that I thought that we should fix the bugs in the re module (i.e.,
>> really in _sre.c) but on second thought I wonder if maybe regex is
>> mature enough to replace re in Python 3.3. It would mean that we won't
>> fix any of these bugs in earlier Python versions, but I could live
>> with that.
>>
>> However, I don't know much about regex -- how compatible is it, how
>> fast is it (including extreme cases where the backtracking goes
>> crazy), how bug-free is it, and so on. Plus, how much work would it be
>> to actually incorporate it into CPython as a complete drop-in
>> replacement of the re package (such that nobody needs to change their
>> imports or the flags they pass to the re module).
>>
>> We'd also probably have to train some core developers to be familiar
>> enough with the code to maintain and evolve it -- I assume we can't
>> just volunteer Matthew to do so forever... :-)
>>
>> What's the alternative? Is adding the requested bug fixes and new
>> features to _sre.c really that hard?
>
> Why not simply add the new lib, see whether it works out and
> then decide which path to follow.
>
> We've done that with the old regex lib. It took a few years
> and releases to have people port their applications to the
> then new re module and syntax, but in the end it worked.
>
> With a new regex library there are likely going to be quite
> a few subtle differences between re and regex - even if it's
> just doing things in a more Unicode compatible way.
>
> I don't think anyone can actually list all the differences given
> the complex nature of regular expressions, so people will
> likely need a few years and releases to get used it before
> a switch can be made.

I can't say I liked how that transition was handled last time around.
I really don't want to have to tell people "Oh, that bug is fixed but
you have to use regex instead of re" and then a few years later have
to tell them "Oh, we're deprecating regex, you should just use re".

I'm really hoping someone has more actual technical understanding of
re vs. regex and can give us some facts about the differences, rather
than, frankly, FUD.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread M.-A. Lemburg

Guido van Rossum wrote:
> I just made a pass of all the Unicode-related bugs filed by Tom
> Christiansen, and found that in several, the response was "this is
> fixed in the regex module [by Matthew Barnett]". I started replying
> that I thought that we should fix the bugs in the re module (i.e.,
> really in _sre.c) but on second thought I wonder if maybe regex is
> mature enough to replace re in Python 3.3. It would mean that we won't
> fix any of these bugs in earlier Python versions, but I could live
> with that.
> 
> However, I don't know much about regex -- how compatible is it, how
> fast is it (including extreme cases where the backtracking goes
> crazy), how bug-free is it, and so on. Plus, how much work would it be
> to actually incorporate it into CPython as a complete drop-in
> replacement of the re package (such that nobody needs to change their
> imports or the flags they pass to the re module).
> 
> We'd also probably have to train some core developers to be familiar
> enough with the code to maintain and evolve it -- I assume we can't
> just volunteer Matthew to do so forever... :-)
> 
> What's the alternative? Is adding the requested bug fixes and new
> features to _sre.c really that hard?

Why not simply add the new lib, see whether it works out and
then decide which path to follow.

We've done that with the old regex lib. It took a few years
and releases to have people port their applications to the
then new re module and syntax, but in the end it worked.

With a new regex library there are likely going to be quite
a few subtle differences between re and regex - even if it's
just doing things in a more Unicode compatible way.

I don't think anyone can actually list all the differences given
the complex nature of regular expressions, so people will
likely need a few years and releases to get used it before
a switch can be made.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 27 2011)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum

I have a different question about IronPython and Jython now. Do their
regular expression libraries support Unicode better than CPython's?
E.g. does "." match a surrogate pair? Tom C suggests that Java's regex
libraries get this and many other details right despite Java's use of
UTF-16 to represent strings. So hopefully Jython's re library is built
on top of Java's?

PS. Is there a better contact for Jython?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Victor Stinner

Le vendredi 26 août 2011 02:01:42, Dino Viehland a écrit :
> The biggest difficulty for IronPython here would be dealing w/ .NET
> interop. We can certainly introduce either an IronPython specific string
> class which is similar to CPython's PyUnicodeObject or we could have
> multiple distinct .NET types (IronPython.Runtime.AsciiString,
> System.String, and
> IronPython.Runtime.Ucs4String) which all appear as the same type to Python.
> 
> But when Python is calling a .NET API it's always going to return a
> System.String which is UTF-16.  If we had to check and convert all of
> those strings when they cross into Python it would be very bad for
> performance.  Presumably we could have a 4th type of "interop" string
> which lazily computes this but if we start wrapping .Net strings we could
> also get into object identity issues.

Python 3 encodes all Unicode strings to the OS encoding (and the result is 
decoded) for all syscalls and calls to libraries: to the locale encoding on 
UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. 
So Py_UNICODE* is already a UTF-16 string.

I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for 
these calls is important or not. But on UNIX, pure ASCII string don't have to 
be encoded anymore if the locale encoding is UTF-8 or ASCII.

IronPython can wait to see how CPython+PEP 383 handles these problems, and how 
slower it is.

> But it's a huge change - it'll almost certainly touch every single source
> file in IronPython.

With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the 
string to UTF-16 (allocate memory, etc.). Except that applications should now 
check if an error occurred (check for NULL).

> I would think we'd get 3.2 done first and then think
> about what to do here.

I don't think that IronPython needs to support non-BMP characters without 
using surrogates. Bug reports about non-BMP characters usually don't have use 
cases, but just want to make Python perfect. There is no need to hurry.

PEP 393 tries to reduce the memory footprint. The effect on non-BMP character 
is just a *nice* border effect. Or was the PEP design to solve narrow build 
issues?

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread Guido van Rossum

I just made a pass of all the Unicode-related bugs filed by Tom
Christiansen, and found that in several, the response was "this is
fixed in the regex module [by Matthew Barnett]". I started replying
that I thought that we should fix the bugs in the re module (i.e.,
really in _sre.c) but on second thought I wonder if maybe regex is
mature enough to replace re in Python 3.3. It would mean that we won't
fix any of these bugs in earlier Python versions, but I could live
with that.

However, I don't know much about regex -- how compatible is it, how
fast is it (including extreme cases where the backtracking goes
crazy), how bug-free is it, and so on. Plus, how much work would it be
to actually incorporate it into CPython as a complete drop-in
replacement of the re package (such that nobody needs to change their
imports or the flags they pass to the re module).

We'd also probably have to train some core developers to be familiar
enough with the code to maintain and evolve it -- I assume we can't
just volunteer Matthew to do so forever... :-)

What's the alternative? Is adding the requested bug fixes and new
features to _sre.c really that hard?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Stefan Behnel


Stefan Behnel, 26.08.2011 20:28:

"Martin v. Löwis", 26.08.2011 18:56:

I agree with your observation that somebody should be done about error
handling, and will update the PEP shortly. I propose that
PyUnicode_Ready should be explicitly called on input where raising an
exception is feasible. In contexts where it is not feasible (such
as reading a character, or reading the length or the kind), failing to
ready the string should cause a fatal error.

[...]
My gut feeling leans towards a KISS approach. If you go the route to
require an explicit point for triggering PyUnicode_Ready() calls, why not
just go all the way and make it completely explicit in *all* cases? I.e.
remove all implicit calls from the macros and make it part of the new API
semantics that users *must* call PyUnicode_FAST_READY() before doing
anything with a new string data layout. Much fewer surprises.

Note that there isn't currently an official macro way to figure out that
the flexible string layout has not been initialised yet, i.e. that wstr is
set but str is not. If the implicit PyUnicode_Ready() calls get removed,
PyUnicode_KIND() could take that place by simply returning WSTR_KIND.


Here's a patch that updates only the header file, to make it clear what I mean.

Stefan
# HG changeset patch
# User Stefan Behnel 
# Date 1314388513 -7200
# Branch pep-393
# Node ID 247e45f0c26f6f0f6a552f2eddb3598ae643adf1
# Parent  675e2004b38e809f12171750388b00620e1967c4
simplify new PyUnicode_*() macros by removing implicit calls to PyUnicode_Ready(); minor cleanups

diff -r 675e2004b38e -r 247e45f0c26f Include/unicodeobject.h
--- a/Include/unicodeobject.h	Fri Aug 26 14:21:14 2011 -0400
+++ b/Include/unicodeobject.h	Fri Aug 26 21:55:13 2011 +0200
@@ -282,18 +282,15 @@
 #define SSTATE_IS_COMPACT 0x10
 
 
-/* String contains only wstr byte characters.  This is only possible
-   when the string was created with a legacy API and PyUnicode_Ready()
-   has not been called yet.  Note that PyUnicode_KIND() calls
-   PyUnicode_FAST_READY() so PyUnicode_WCHAR_KIND is only possible as a
-   intialized value not as a result of PyUnicode_KIND(). */
-#define PyUnicode_WCHAR_KIND 0
-
 /* Return values of the PyUnicode_KIND() macro: */
-
 #define PyUnicode_1BYTE_KIND 1
 #define PyUnicode_2BYTE_KIND 2
 #define PyUnicode_4BYTE_KIND 3
+#define PyUnicode_WCHAR_KIND 0 /* String contains only wstr byte
+  characters.  This is the case when
+  the string was created with a legacy
+  API and PyUnicode_Ready() has not
+  been called yet. */
 
 
 /* Return the number of bytes the string uses to represent single characters,
@@ -301,11 +298,10 @@
 #define PyUnicode_CHARACTER_SIZE(op) \
 (1 << (((SSTATE_KIND_MASK & ((PyUnicodeObject *)(op))->state) >> 2) - 1))
 
-/* Return pointers to the canonical representation casted as unsigned char,
-   Py_UCS2, or Py_UCS4 for direct character access.
-   No checks are performed, use PyUnicode_CHARACTER_SIZE or
-   PyUnicode_KIND() before to ensure these will work correctly. */
-
+/* Return pointers to the canonical representation cast as Py_UCS1,
+   Py_UCS2, or Py_UCS4 for direct character access.  No checks are
+   performed, use PyUnicode_FAST_READY() before to ensure these will
+   work correctly. */
 #define PyUnicode_1BYTE_DATA(op) (((PyUnicodeObject*)op)->data.latin1)
 #define PyUnicode_2BYTE_DATA(op) (((PyUnicodeObject*)op)->data.ucs2)
 #define PyUnicode_4BYTE_DATA(op) (((PyUnicodeObject*)op)->data.ucs4)
@@ -315,18 +311,16 @@
 #define PyUnicode_IS_COMPACT(op) \
 (((op)->state & SSTATE_COMPACT_MASK) == SSTATE_IS_COMPACT)
 
-/* Return one of the PyUnicode_*_KIND values defined above.
-   This macro calls PyUnicode_FAST_READY() before returning the kind. */ 
+/* Return one of the PyUnicode_*_KIND values defined above. */ 
 #define PyUnicode_KIND(op) \
 (assert(PyUnicode_Check(op)), \
- PyUnicode_FAST_READY((PyUnicodeObject *)(op)), \
  ((SSTATE_KIND_MASK & (((PyUnicodeObject *)(op))->state)) >> 2))
 
-/* Return a void pointer to the raw unicode buffer.
-   This macro calls PyUnicode_FAST_READY() before returning the pointer. */ 
+/* Return a void pointer to the raw unicode buffer.  The result is
+   potentially NULL if it has not been initialised, in which case
+   PyUnicode_AS_UNICODE() returns the pointer to the wstr buffer. */ 
 #define PyUnicode_DATA(op) \
 (assert(PyUnicode_Check(op)), \
- PyUnicode_FAST_READY((PyUnicodeObject *)(op)), \
  PyUnicodeObject *)(op))->data.any)))
 
 /* Write into the canonical representation, this macro does not do any sanity
@@ -366,8 +360,9 @@
 
 /* PyUnicode_READ_CHAR() is less efficient than PyUnicode_READ() because it
calls PyUnicode_KIND() and might call it twice.  For single reads, use
-   PyUnicode_READ_CHAR, for multiple consecutive reads callers should
-   cache kind and use PyUnicode_READ instead. */
+   PyUnicode_READ_CHAR(), for multiple consecutive reads callers should
+   cache kind and use PyUnicode_READ() instead.
+   Requires t

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Stefan Behnel


"Martin v. Löwis", 26.08.2011 18:56:

I agree with your observation that somebody should be done about error
handling, and will update the PEP shortly. I propose that
PyUnicode_Ready should be explicitly called on input where raising an
exception is feasible. In contexts where it is not feasible (such
as reading a character, or reading the length or the kind), failing to
ready the string should cause a fatal error.


I consider this an increase in complexity. It will then no longer be enough 
to access the data, the user will first have to figure out a suitable place 
in the code to make sure it's actually there, potentially forgetting about 
it because it works in all test cases, or potentially triggering a huge 
amount of overhead that copies and 'recodes' the string data by executing 
one of the macros that does it automatically.


For the specific case of Cython, I would guess that I could just add 
another special case that reads the data from the Py_UNICODE buffer and 
combines surrogates at need, but that will only work in some cases 
(specifically not for indexing). And outside of Cython, most normal user 
code won't do that.


My gut feeling leans towards a KISS approach. If you go the route to 
require an explicit point for triggering PyUnicode_Ready() calls, why not 
just go all the way and make it completely explicit in *all* cases? I.e. 
remove all implicit calls from the macros and make it part of the new API 
semantics that users *must* call PyUnicode_FAST_READY() before doing 
anything with a new string data layout. Much fewer surprises.


Note that there isn't currently an official macro way to figure out that 
the flexible string layout has not been initialised yet, i.e. that wstr is 
set but str is not. If the implicit PyUnicode_Ready() calls get removed, 
PyUnicode_KIND() could take that place by simply returning WSTR_KIND.


That being said, the main problem I currently see is that basically all 
existing code needs to be updated in order to handle these errors. 
Otherwise, it would be possible to trigger crashes by properly forging a 
string and passing it into an unprepared C library to let it run into a 
NULL pointer return value of PyUnicode_AS_UNICODE().


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel


Guido van Rossum, 26.08.2011 19:02:

On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote:

Besides, what if these implementations provided indexing in, say, O(log N)
instead of O(1) or O(N), e.g. by building a tree index into each string? You
could have an index that simply marks runs of surrogate pairs and BMP
substrings, thus providing a likely-to-be-somewhat-compact index. That index
would obviously have to be built, but so do the different string
representations in post-PEP-393 CPython, especially on Windows, as I have
learned.


Eek. No, please.


I was mostly just confabulating. My main point was that this isn't a 
black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. 
You can achieve compliant/acceptable behaviour at the code point level, the 
performance guarantees level or the platform integration level - choose any 
two. CPython is just lucky that there isn't really a platform integration 
level to take into account (if we leave the Windows environment aside for a 
moment).




Those platforms' native string types have length and
slicing operations that are O(1) and work in terms of 16-bit code
points. Python should use those. It would be awful if Java and Python
code doing the same manipulations on the same string would come to
different conclusions because Python tried to paper over surrogates.


I fully agree.



Would such a less severe violation of the strict O(1) rule still be "not
ok"? I think this is not such a clear black-and-white issue. Both
implementations have notably different performance characteristics than
CPython in some more or less important areas, as does PyPy. At some point,
the language compliance label has to account for that.


Since you had to ask, I have to declare that, indeed, non-O(1)
behavior would not be okay for those platforms.


I take it that you say that because you want strings to perform in the 
'normal' platform specific way here (i.e. like Java/.NET strings), and not 
so much because you want to require the exact same (performance) 
characteristics across Python implementations. So your choice is platform 
integration over code points, leaving the identical performance as a 
side-effect of the platform integration.




All in all, I don't think we should legislate Python strings to be
able to support 21-bit code points using O(1) indexing. PEP 393 makes
this possible for CPython, and it's been said that PyPy can follow
suit. But it'll be a "quality-of-implementation" issue, not built into
the language spec.


Makes sense to me. Most likely, Unicode heavy Python code will have to take 
platform specifics into account anyway, so there are limits as to what is 
suitable for a language spec.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Windows installers and %PATH%

2011-08-26 Thread Brian Curtin

On Fri, Aug 26, 2011 at 12:18, Andrew Pennebaker <
andrew.penneba...@gmail.com> wrote:

> Also, there's no need to "buy in" to the Windows toolchain just to edit
> PATH. Installer software includes functionality for editing environment
> variables, and in any case Python has built in environment variable editing,
> even for Windows.
>

The built-in environment variable support, e.g., os.getenv/putenv/environ,
isn't helpful here as it does not modify the global environment. It modifies
the current process and usually subprocesses. The proper way to apply
environment variable changes to the entire system is via the registry and
broadcasting a setting change message.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 10:13 AM, Paul Moore  wrote:
> On 26 August 2011 18:02, Guido van Rossum  wrote:
>
>> Eek. No, please. Those platforms' native string types have length and
>> slicing operations that are O(1) and work in terms of 16-bit code
>> points. Python should use those. It would be awful if Java and Python
>> code doing the same manipulations on the same string would come to
>> different conclusions because Python tried to paper over surrogates.
>
> *That* is actually the erroneous assumption I had made - that the Java
> and .NET native string type had code point semantics (i.e., took
> surrogates into account). As that isn't the case, my comments aren't
> valid - and I agree that having common semantics (and hence exposing
> surrogates) is too important to lose.

Those platforms probably *also* have libraries of operations to
support writing apps that conform to the Unicode standard. But those
apps will have to be aware of the difference between the "naive"
length of a string and the number of code points of characters in it.

> On the other hand, that pretty much establishes that whatever PEP 393
> achieves in terms of allowing all builds of CPython to offer code
> point semantics, the language definition can't mandate it.

The most severe consequence to me seems that the stdlib (which is
reused by those other platforms) cannot assume CPython's ideal world
-- even if specific apps sometimes can.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Windows installers and %PATH%

2011-08-26 Thread Andrew Pennebaker

I mentioned PYTHONROOT\Script because of the distribute package, which adds
PYTHONROOT\Script\easy_install.exe.

My mistake if \Script is created by distribute and not Python. Then my beef
is with distribute for not adding its binaries to PATH--how else would I use
easy_setup if not in a terminal?

Cheers,

Andrew Pennebaker
www.yellosoft.us

On Fri, Aug 26, 2011 at 9:40 AM, Brian Curtin wrote:

> On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker <
> andrew.penneba...@gmail.com> wrote:
>
>> Please have the Windows installers add the Python installation directory
>> to the PATH environment variable.
>
>
> The http://bugs.python.org bug tracker is a better place for feature
> requests like this, of which there have been several over the years. This
> has become a hotter topic lately with several discussions around the
> community, and a PEP to provide some similar functionality. I've talked with
> several educators/trainers around and the lack of a Path installation is the
> #1 thing that bites their newcomers, and it's an issue that bites them
> before they've even begun to learn.
>
> Many newbies dive in without knowing that they must manually add
>> C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that
>> should have been done by the installers way back in Python 1.0.
>>
>> Please also add PYTHONROOT\Scripts. It's where cool things like
>> easy_install.exe are stored. More yak shaving.
>>
>
> A clean installation of Python includes no Scripts directory, so I'm not
> sure we should be polluting the Path with yet-to-exist directories. An
> approach could be to have packaging optionally add the scripts directory on
> the installation of a third-party package.
>
> The only potential downside to this is upsetting users who manage multiple
>> python installations. It's not a problem: they already manually adjust PATH
>> to their liking.
>>
>
> "Users who manage multiple python installations" is probably a very, very
> large number, so we have quite the audience to appease, and it actually is a
> problem. We should not go halfway on this feature and say "if it doesn't
> work perfectly, you're back to being on your own". I think the likely case
> is that any path addition feature will read the path, then offer to replace
> existing instances or append to the end.
>
> I haven't yet done any work on this, but my todo list for 3.3 includes
> adding some path related features to the installer.
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Windows installers and %PATH%

2011-08-26 Thread Andrew Pennebaker

I see that the Ruby 1.9 stable Windows installer has a checkbox to add the
Ruby binaries to PATH. That would be excellent for Python.

Also, there's no need to "buy in" to the Windows toolchain just to edit
PATH. Installer software includes functionality for editing environment
variables, and in any case Python has built in environment variable editing,
even for Windows.

Cheers,

Andrew Pennebaker
www.yellosoft.us

On Fri, Aug 26, 2011 at 9:40 AM, Brian Curtin wrote:

> On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker <
> andrew.penneba...@gmail.com> wrote:
>
>> Please have the Windows installers add the Python installation directory
>> to the PATH environment variable.
>
>
> The http://bugs.python.org bug tracker is a better place for feature
> requests like this, of which there have been several over the years. This
> has become a hotter topic lately with several discussions around the
> community, and a PEP to provide some similar functionality. I've talked with
> several educators/trainers around and the lack of a Path installation is the
> #1 thing that bites their newcomers, and it's an issue that bites them
> before they've even begun to learn.
>
> Many newbies dive in without knowing that they must manually add
>> C:\PythonXY to PATH. It's yak shaving, something perfectly automatable that
>> should have been done by the installers way back in Python 1.0.
>>
>> Please also add PYTHONROOT\Scripts. It's where cool things like
>> easy_install.exe are stored. More yak shaving.
>>
>
> A clean installation of Python includes no Scripts directory, so I'm not
> sure we should be polluting the Path with yet-to-exist directories. An
> approach could be to have packaging optionally add the scripts directory on
> the installation of a third-party package.
>
> The only potential downside to this is upsetting users who manage multiple
>> python installations. It's not a problem: they already manually adjust PATH
>> to their liking.
>>
>
> "Users who manage multiple python installations" is probably a very, very
> large number, so we have quite the audience to appease, and it actually is a
> problem. We should not go halfway on this feature and say "if it doesn't
> work perfectly, you're back to being on your own". I think the likely case
> is that any path addition feature will read the path, then offer to replace
> existing instances or append to the end.
>
> I haven't yet done any work on this, but my todo list for 3.3 includes
> adding some path related features to the installer.
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Paul Moore

On 26 August 2011 17:51, Guido van Rossum  wrote:
> On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis"  wrote:

(Regarding my comments on code point semantics)

>> You seem to assume it is ok for Jython/IronPython to provide indexing in
>> O(n). It is not.
>
> Indeed.

On 26 August 2011 18:02, Guido van Rossum  wrote:

> Eek. No, please. Those platforms' native string types have length and
> slicing operations that are O(1) and work in terms of 16-bit code
> points. Python should use those. It would be awful if Java and Python
> code doing the same manipulations on the same string would come to
> different conclusions because Python tried to paper over surrogates.

*That* is actually the erroneous assumption I had made - that the Java
and .NET native string type had code point semantics (i.e., took
surrogates into account). As that isn't the case, my comments aren't
valid - and I agree that having common semantics (and hence exposing
surrogates) is too important to lose.

On the other hand, that pretty much establishes that whatever PEP 393
achieves in terms of allowing all builds of CPython to offer code
point semantics, the language definition can't mandate it.

Thanks for the clarification.
Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel  wrote:
> "Martin v. Löwis", 26.08.2011 11:29:
>>
>> You seem to assume it is ok for Jython/IronPython to provide indexing in
>> O(n). It is not.
>
> I think we can leave this discussion aside.

(And yet, you keep arguing. :-)

> Jython and IronPython have their
> own platform specific constraints to which they need to adapt their
> implementation. For a Jython user, it means a lot to be able to efficiently
> pass strings (and other data) back and forth between Jython and other JVM
> code, and it's not hard to guess that the same is true for IronPython/.NET
> users. After all, the platform integration is the very *reason* for most
> users to select one of these implementations.

Right.

> Besides, what if these implementations provided indexing in, say, O(log N)
> instead of O(1) or O(N), e.g. by building a tree index into each string? You
> could have an index that simply marks runs of surrogate pairs and BMP
> substrings, thus providing a likely-to-be-somewhat-compact index. That index
> would obviously have to be built, but so do the different string
> representations in post-PEP-393 CPython, especially on Windows, as I have
> learned.

Eek. No, please. Those platforms' native string types have length and
slicing operations that are O(1) and work in terms of 16-bit code
points. Python should use those. It would be awful if Java and Python
code doing the same manipulations on the same string would come to
different conclusions because Python tried to paper over surrogates.

I dug up some evidence for Java, at least:

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/CharSequence.html#length%28%29

"""
length

int length()

Returns the length of this character sequence. The length is the
number of 16-bit chars in the sequence.

Returns:
the number of chars in this sequence
"""

This is quite explicit about counting 16-bit code units. I've found
similar info about .NET, which defines "char" as a 16-bit quantity and
string length in terms of the number of "char" items.

> Would such a less severe violation of the strict O(1) rule still be "not
> ok"? I think this is not such a clear black-and-white issue. Both
> implementations have notably different performance characteristics than
> CPython in some more or less important areas, as does PyPy. At some point,
> the language compliance label has to account for that.

Since you had to ask, I have to declare that, indeed, non-O(1)
behavior would not be okay for those platforms.

All in all, I don't think we should legislate Python strings to be
able to support 21-bit code points using O(1) indexing. PEP 393 makes
this possible for CPython, and it's been said that PyPy can follow
suit. But it'll be a "quality-of-implementation" issue, not built into
the language spec.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Martin v. Löwis

Am 26.08.2011 17:55, schrieb Stefan Behnel:
> Stefan Behnel, 25.08.2011 23:30:
>> Sadly, a quick look at a couple of recent commits in the pep-393 branch
>> suggested that it is not even always obvious to you as the authors which
>> macros can be called safely and which cannot. I immediately spotted a bug
>> in one of the updated core functions (unicode_repr, IIRC) where
>> PyUnicode_GET_LENGTH() is called without a previous call to
>> PyUnicode_FAST_READY().
> 
> Here is another example from unicodeobject.c, commit 56aaa17fc05e:
> 
> +switch(PyUnicode_KIND(string)) {
> +case PyUnicode_1BYTE_KIND:
> +list = ucs1lib_splitlines(
> +(PyObject*) string, PyUnicode_1BYTE_DATA(string),
> +PyUnicode_GET_LENGTH(string), keepends);
> +break;
> +case PyUnicode_2BYTE_KIND:
> +list = ucs2lib_splitlines(
> +(PyObject*) string, PyUnicode_2BYTE_DATA(string),
> +PyUnicode_GET_LENGTH(string), keepends);
> +break;
> +case PyUnicode_4BYTE_KIND:
> +list = ucs4lib_splitlines(
> +(PyObject*) string, PyUnicode_4BYTE_DATA(string),
> +PyUnicode_GET_LENGTH(string), keepends);
> +break;
> +default:
> +assert(0);
> +list = 0;
> +}
> 
> The assert(0) at the end will hit when the system is running out of
> memory while working on a wchar string.

No, that should not happen: it should never get to this point.

I agree with your observation that somebody should be done about error
handling, and will update the PEP shortly. I propose that
PyUnicode_Ready should be explicitly called on input where raising an
exception is feasible. In contexts where it is not feasible (such
as reading a character, or reading the length or the kind), failing to
ready the string should cause a fatal error.

What do you think?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Guido van Rossum

On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis"  wrote:
>> IronPython and Jython can retain UTF-16 as their native form if that
>> makes interop cleaner, but in doing so they need to ensure that basic
>> operations like indexing and len work in terms of code points, not
>> code units, if they are to conform.
>
> That means that they won't conform, period. There is no efficient
> maintainable implementation strategy to achieve that property, and
> it may take well years until somebody provides an efficient
> unmaintainable implementation.
>
>> Does this make sense, or have I completely misunderstood things?
>
> You seem to assume it is ok for Jython/IronPython to provide indexing in
> O(n). It is not.

Indeed.

> However, non-conformance may not be that much of an issue. They do not
> conform in many other aspects, either (such as not supporting Python 3,
> for example, or not supporting the C API) that they may well chose to
> ignore such a minor requirement if there was one. For BMP strings,
> they conform fine, and it may well be that Jython eithers either don't
> have non-BMP strings, or don't care whether len() or indexing of their
> non-BMP strings is "correct".

I think this is fine. I had been hoping that all Python
implementations claiming compatibility with version 3.3 of the
language reference would be free of worries about surrogates, but it
simply doesn't make sense.

And yes, I'm well aware that PEP 393 is only for CPython. It's just
that I had hoped that it would get rid of some of Tom C's specific
complaints for all Python implementations; but it really seems
impossible to do so.

One consequence may be that the standard library, to the extent it is
shared by other implementations, may still have to worry about
surrogates and other issues inherent in narrow builds or other
16-bit-based string types. We'll cross that bridge when we get to it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Planned PEP status changes

2011-08-26 Thread Brett Cannon

On Tue, Aug 23, 2011 at 19:42, Nick Coghlan  wrote:
> Unless I hear any objections, I plan to adjust the current PEP
> statuses as follows some time this weekend:
>
> Move from Accepted to Finished:
>
>    389  argparse - New Command Line Parsing Module              Bethard
>    391  Dictionary-Based Configuration For Logging              Sajip
>    3108  Standard Library Reorganization                         Cannon

 I had always hoped to get profile/cProfile taken care of, but
obviously that just didn't ever happen. So no objection, just a slight
sting from the reminder of why the PEP was left open.

-Brett

>    3135  New Super
> Spealman, Delaney, Ryan
>
> Move from Accepted to Withdrawn (with a reference to Reid Kleckner's blog 
> post)
>    3146  Merging Unladen Swallow into CPython
> Winter, Yasskin, Kleckner
>
>
> The PEP 3118 enhanced buffer protocol has some ongoing semantic and
> implementation issues still to be worked out, so I plan to leave that
> at Accepted. Ditto for PEP 3121 (extension module finalisation), since
> that doesn't play nicely with the current 'set everything to None'
> approach to breaking cycles during module finalisation.
>
> The other Accepted PEPs are either packaging standards related or
> genuinely not implemented yet.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Summary of Python tracker Issues

2011-08-26 Thread Python tracker


ACTIVITY SUMMARY (2011-08-19 - 2011-08-26)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open2963 (+26)
  closed 21665 (+35)
  total  24628 (+61)

Open issues with patches: 1288 


Issues opened (44)
==

#12326: Linux 3: code should avoid using sys.platform == 'linux2'
http://bugs.python.org/issue12326  reopened by georg.brandl

#12788: test_email fails with -R
http://bugs.python.org/issue12788  opened by pitrou

#12790: doctest.testmod does not run tests in functools.partial functi
http://bugs.python.org/issue12790  opened by stevenjd

#12793: allow filters in os.walk
http://bugs.python.org/issue12793  opened by Jacek.Pliszka

#12795: Remove the major version from sys.platform
http://bugs.python.org/issue12795  opened by haypo

#12797: io.FileIO and io.open should support openat
http://bugs.python.org/issue12797  opened by pitrou

#12798: Update mimetypes documentation
http://bugs.python.org/issue12798  opened by sandro.tosi

#12800: 'tarfile.StreamError: seeking backwards is not allowed' when e
http://bugs.python.org/issue12800  opened by adunand

#12801: C realpath not used by os.path.realpath
http://bugs.python.org/issue12801  opened by pitrou

#12802: Windows error code 267 should be mapped to ENOTDIR, not EINVAL
http://bugs.python.org/issue12802  opened by pitrou

#12805: Optimizations for bytes.join() et. al
http://bugs.python.org/issue12805  opened by jcon

#12806: argparse: Hybrid help text formatter
http://bugs.python.org/issue12806  opened by GraylinKim

#12807: Optimizations for {bytearray,bytes,unicode}.strip()
http://bugs.python.org/issue12807  opened by jcon

#12808: Coverage of codecs.py
http://bugs.python.org/issue12808  opened by tleeuwenburg

#12809: Missing new setsockopts in Linux (eg: IP_TRANSPARENT)
http://bugs.python.org/issue12809  opened by micolous

#12812: libffi does not build with clang on amd64
http://bugs.python.org/issue12812  opened by shenki

#12813: uuid4 is not tested if a uuid4 system routine isn't present
http://bugs.python.org/issue12813  opened by anacrolix

#12814: Possible intermittent bug in test_array
http://bugs.python.org/issue12814  opened by ncoghlan

#12815: Coverage of smtpd.py
http://bugs.python.org/issue12815  opened by tleeuwenburg

#12816: smtpd uses library outside of the standard libraries
http://bugs.python.org/issue12816  opened by tleeuwenburg

#12817: test_multiprocessing: io.BytesIO() requires bytearray buffers
http://bugs.python.org/issue12817  opened by skrah

#12818: email.utils.formataddr incorrectly quotes parens inside quoted
http://bugs.python.org/issue12818  opened by r.david.murray

#12819: PEP 393 - Flexible Unicode String Representation
http://bugs.python.org/issue12819  opened by torsten.becker

#12820: Tests for Lib/xml/dom/minicompat.py
http://bugs.python.org/issue12820  opened by John.Chandler

#12822: NewGIL should use CLOCK_MONOTONIC if possible.
http://bugs.python.org/issue12822  opened by naoki

#12823: Broken link in "SSL wrapper for socket objects" document
http://bugs.python.org/issue12823  opened by iworm

#12825: Missing and incorrect link to a command line option.
http://bugs.python.org/issue12825  opened by Kyle.Simpson

#12828: xml.dom.minicompat is not documented
http://bugs.python.org/issue12828  opened by sandro.tosi

#12829: pyexpat segmentation fault caused by multiple calls to Parse()
http://bugs.python.org/issue12829  opened by dhgutteridge

#12830: --install-data doesn't effect resources destination
http://bugs.python.org/issue12830  opened by trevor

#12832: The documentation for the print function should explain/point 
http://bugs.python.org/issue12832  opened by r.david.murray

#12833: raw_input misbehaves when readline is imported
http://bugs.python.org/issue12833  opened by idank

#12834: memoryview.tobytes() incorrect for non-contiguous arrays
http://bugs.python.org/issue12834  opened by skrah

#12835: Missing SSLSocket.sendmsg() wrapper allows programs to send un
http://bugs.python.org/issue12835  opened by baikie

#12836: cast() creates circular reference in original object
http://bugs.python.org/issue12836  opened by bgilbert

#12837: Patch for issue #12810 removed a valid check on socket ancilla
http://bugs.python.org/issue12837  opened by baikie

#12839: zlibmodule cannot handle Z_VERSION_ERROR zlib error
http://bugs.python.org/issue12839  opened by rmtew

#12840: "maintainer" value clear the "author" value when register
http://bugs.python.org/issue12840  opened by keul

#12841: Incorrect tarfile.py extraction
http://bugs.python.org/issue12841  opened by seblu

#12842: Docs: first parameter of tp_richcompare() always has the corre
http://bugs.python.org/issue12842  opened by skrah

#12843: file object read* methods in append mode overflows
http://bugs.python.org/issue12843  opened by Otacon.Karurosu

#12844: Support more than 255 arguments
http://bug

Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-26 Thread Antoine Pitrou


Hi,

> I think that "deprecating" the use of threads w/ multiprocessing - or
> at least crippling it is the wrong answer. Multiprocessing needs the
> helper threads it uses internally to manage queues, etc. Removing that
> ability would require a near-total rewrite, which is just a
> non-starter.

I agree that this wouldn't actually benefit anyone.
Besides, I don't think it's even possible to avoid threads in
multiprocessing, given the various constraints. We would have to force
the user to run their main thread in an event loop, and that would be
twisted (tm).

> I would focus on the atfork() patch more directly, ignoring
> multiprocessing in the discussion, and focusing on the merits of gps'
> initial proposal and patch.

I think this could also be combined with Charles-François' patch.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Stefan Behnel


Stefan Behnel, 25.08.2011 23:30:

Sadly, a quick look at a couple of recent commits in the pep-393 branch
suggested that it is not even always obvious to you as the authors which
macros can be called safely and which cannot. I immediately spotted a bug
in one of the updated core functions (unicode_repr, IIRC) where
PyUnicode_GET_LENGTH() is called without a previous call to
PyUnicode_FAST_READY().


Here is another example from unicodeobject.c, commit 56aaa17fc05e:

+switch(PyUnicode_KIND(string)) {
+case PyUnicode_1BYTE_KIND:
+list = ucs1lib_splitlines(
+(PyObject*) string, PyUnicode_1BYTE_DATA(string),
+PyUnicode_GET_LENGTH(string), keepends);
+break;
+case PyUnicode_2BYTE_KIND:
+list = ucs2lib_splitlines(
+(PyObject*) string, PyUnicode_2BYTE_DATA(string),
+PyUnicode_GET_LENGTH(string), keepends);
+break;
+case PyUnicode_4BYTE_KIND:
+list = ucs4lib_splitlines(
+(PyObject*) string, PyUnicode_4BYTE_DATA(string),
+PyUnicode_GET_LENGTH(string), keepends);
+break;
+default:
+assert(0);
+list = 0;
+}

The assert(0) at the end will hit when the system is running out of memory 
while working on a wchar string.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Guido van Rossum

Also, please add the table (and the reasoning that led to it) to the PEP.

On Fri, Aug 26, 2011 at 7:55 AM, Guido van Rossum  wrote:
> It would be nice if someone wrote a test to roughly verify these
> numbers, e.v. by allocating lots of strings of a certain size and
> measuring the process size before and after (being careful to adjust
> for the list or other data structure required to keep those objects
> alive).
>
> --Guido
>
> On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. Löwis"  wrote:
>>> But strings are allocated via PyObject_Malloc(), i.e. the custom
>>> arena-based allocator -- isn't its overhead (for small objects) less
>>> than 2 pointers per block?
>>
>> Ah, right, I missed that. Indeed, those have no header, and the only
>> overhead is the padding to a multiple of 8.
>>
>> That shifts the picture; I hope the table below is correct,
>> assuming ASCII strings.
>> 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems)
>> 393: 10 pointers
>>
>> string | 32-bit pointer | 32-bit pointer | 64-bit pointer
>> size   | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t
>>       | 3.2     |  393 | 3.2    |  393  | 3.2    |  393  |
>> ---
>> 1      | 40      | 48   | 40     |  48   | 64     | 88    |
>> 2      | 40      | 48   | 48     |  48   | 72     | 88    |
>> 3      | 40      | 48   | 48     |  48   | 72     | 88    |
>> 4      | 48      | 48   | 56     |  48   | 80     | 88    |
>> 5      | 48      | 48   | 56     |  48   | 80     | 88    |
>> 6      | 48      | 48   | 64     |  48   | 88     | 88    |
>> 7      | 48      | 48   | 64     |  48   | 88     | 88    |
>> 8      | 56      | 56   | 72     |  56   | 96     | 86    |
>>
>> So 1-byte strings increase in size; very short strings increase
>> on 16-bit-wchar_t systems and 64-bit systems. Short strings
>> keep there size, and long strings save.
>>
>> Regards,
>> Martin
>>
>>
>>
>
>
>
> --
> --Guido van Rossum (python.org/~guido)
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Guido van Rossum

It would be nice if someone wrote a test to roughly verify these
numbers, e.v. by allocating lots of strings of a certain size and
measuring the process size before and after (being careful to adjust
for the list or other data structure required to keep those objects
alive).

--Guido

On Fri, Aug 26, 2011 at 3:29 AM, "Martin v. Löwis"  wrote:
>> But strings are allocated via PyObject_Malloc(), i.e. the custom
>> arena-based allocator -- isn't its overhead (for small objects) less
>> than 2 pointers per block?
>
> Ah, right, I missed that. Indeed, those have no header, and the only
> overhead is the padding to a multiple of 8.
>
> That shifts the picture; I hope the table below is correct,
> assuming ASCII strings.
> 3.2: 7 pointers (adds 4 bytes padding on 32-bit systems)
> 393: 10 pointers
>
> string | 32-bit pointer | 32-bit pointer | 64-bit pointer
> size   | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t
>       | 3.2     |  393 | 3.2    |  393  | 3.2    |  393  |
> ---
> 1      | 40      | 48   | 40     |  48   | 64     | 88    |
> 2      | 40      | 48   | 48     |  48   | 72     | 88    |
> 3      | 40      | 48   | 48     |  48   | 72     | 88    |
> 4      | 48      | 48   | 56     |  48   | 80     | 88    |
> 5      | 48      | 48   | 56     |  48   | 80     | 88    |
> 6      | 48      | 48   | 64     |  48   | 88     | 88    |
> 7      | 48      | 48   | 64     |  48   | 88     | 88    |
> 8      | 56      | 56   | 72     |  56   | 96     | 86    |
>
> So 1-byte strings increase in size; very short strings increase
> on 16-bit-wchar_t systems and 64-bit systems. Short strings
> keep there size, and long strings save.
>
> Regards,
> Martin
>
>
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-26 Thread Jesse Noller

On Fri, Aug 26, 2011 at 3:18 AM, Nir Aides  wrote:
> Another face of the discussion is about whether to deprecate the mixing of
> the threading and processing modules and what to do about the
> multiprocessing module which is implemented with worker threads.

There's a bug open - http://bugs.python.org/issue8713 which would
offer non windows users the ability to avoid using fork() entirely,
which would sidestep the problem outlined in the atfork() bug. Under
windows, which has no fork() mechanism, we create a subprocess and
then use pipes for intercommunication: nothing is inherited from the
parent process except the state passed into the child.

I think that "deprecating" the use of threads w/ multiprocessing - or
at least crippling it is the wrong answer. Multiprocessing needs the
helper threads it uses internally to manage queues, etc. Removing that
ability would require a near-total rewrite, which is just a
non-starter.

I'd rather examine bug 8713 more closely, and offer this option for
all users in 3.x and document the existing issues outlined in
http://bugs.python.org/issue6721 for 2.x - the proposals in that bug
are IMHO, out of bounds for a 2.x release.

In essence; the issue here is multiprocessing's use of fork on unix
without the following exec - which is what the windows implementation
essentially does using subprocess.

Adding the option to *not* fork changes the fundamental behavior on
unix systems - but I fundamentally feel that it's a saner, and more
consistent behavior for the module as a whole.

So, I'd ask that we not talk about tearing out the ability to use MP
and threads, or threads with MP - that would be crippling, and there's
existing code in the wild (including multiprocessing itself) that uses
this mix without issue - it's stripping out functionality for what is
a surprising and painful edge case that rarely directly affects users.

I would focus on the atfork() patch more directly, ignoring
multiprocessing in the discussion, and focusing on the merits of gps'
initial proposal and patch.

jesse
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Windows installers and %PATH%

2011-08-26 Thread Brian Curtin

On Thu, Aug 25, 2011 at 23:04, Andrew Pennebaker <
andrew.penneba...@gmail.com> wrote:

> Please have the Windows installers add the Python installation directory to
> the PATH environment variable.

The http://bugs.python.org bug tracker is a better place for feature
requests like this, of which there have been several over the years. This
has become a hotter topic lately with several discussions around the
community, and a PEP to provide some similar functionality. I've talked with
several educators/trainers around and the lack of a Path installation is the
#1 thing that bites their newcomers, and it's an issue that bites them
before they've even begun to learn.

Many newbies dive in without knowing that they must manually add C:\PythonXY
> to PATH. It's yak shaving, something perfectly automatable that should have
> been done by the installers way back in Python 1.0.
>
> Please also add PYTHONROOT\Scripts. It's where cool things like
> easy_install.exe are stored. More yak shaving.
>

A clean installation of Python includes no Scripts directory, so I'm not
sure we should be polluting the Path with yet-to-exist directories. An
approach could be to have packaging optionally add the scripts directory on
the installation of a third-party package.

The only potential downside to this is upsetting users who manage multiple
> python installations. It's not a problem: they already manually adjust PATH
> to their liking.
>

"Users who manage multiple python installations" is probably a very, very
large number, so we have quite the audience to appease, and it actually is a
problem. We should not go halfway on this feature and say "if it doesn't
work perfectly, you're back to being on your own". I think the likely case
is that any path addition feature will read the path, then offer to replace
existing instances or append to the end.

I haven't yet done any work on this, but my todo list for 3.3 includes
adding some path related features to the installer.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Windows installers and %PATH%

2011-08-26 Thread Antoine Pitrou

On Fri, 26 Aug 2011 14:52:07 +1000
Nick Coghlan  wrote:
> Windows is a developer hostile platform unless you completely buy into
> the Microsoft toolchain, which is not an option for cross-platform
> projects like Python.

We already buy into the MS toolchain since we require Visual Studio (or
at least the command-line tools for building, but I suppose anyone doing
serious development on Windows would use the GUI). We also maintain the
project files by hand instead of using e.g. cmake.

> It's well within Microsoft's capabilities to create and support a
> POSIX compatibility layer that allows applications to look and feel
> like native ones

I have a hard time imagining how a POSIX compatibility layer would
make Windows apps feel more "native".
It's a matter of fact that Unix and Windows systems function
differently. I don't know how much of it can be completely hidden.

> the multibillion dollar corporation deliberately
> failing to implement a widely recognised OS interoperability
> standard

I wouldn't call POSIX an OS interoperability standard, but an Unix
interoperability standard. It exists because there is so much
fragmentation in the Unix world. I doubt MS was invited to the party
when POSIX specifications were designed.

Windows has its own standards, but since MS is basically the sole OS
vendor, they are free to dictate them :)

And when I look at the various "POSIX" systems we try to support there:
http://www.python.org/dev/buildbot/all/waterfall?category=3.x.stable&category=3.x.unstable
I have the feeling that perhaps we spend more time trying to work around
incompatibilities, special cases and various levels of (in)compliance
among POSIX systems, than implementing the Windows-specific code paths
of low-level functions (where the APIs are usually well-defined and
very stable).

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel


Antoine Pitrou, 26.08.2011 12:51:

Why would PEP 393 apply to other implementations than CPython?


Not the PEP itself, just the implications of the result.

The question was whether the language specification in a post PEP-393 can 
(and if so, should) be changed into requiring unicode objects to be defined 
based on code points. Narrow builds, as well as Jython and IronPython, 
currently deviate from this as they use UTF-16 as their native string 
encoding, which, for one, prevents O(1) indexing into characters as well as 
a direct match between length and character count (minus combining 
characters etc.).


I think this discussion can safely be considered off-topic for this thread 
(which isn't exactly short enough to keep adding more topics to it).


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Antoine Pitrou


Why would PEP 393 apply to other implementations than CPython?

Regards

Antoine.



On Fri, 26 Aug 2011 00:01:42 +
Dino Viehland  wrote:
> Guido wrote:
> > Which reminds me. The PEP does not say what other Python
> > implementations besides CPython should do. presumably Jython and
> > IronPython will continue to use UTF-16, so presumably the language
> > reference will still have to document that strings contain code units (not 
> > code
> > points) and the objections Tom Christiansen raised against this will remain
> > true for those versions of Python. (I don't know about PyPy, they can
> > presumably decide when they start their Py3k
> > port.)
> > 
> > OTOH perhaps IronPython 3.3 and Jython 3.3 can use a similar approach and
> > we can lay the narrow build issues to rest? Can someone here speak for
> > them?
> 
> The biggest difficulty for IronPython here would be dealing w/ .NET interop.
> We can certainly introduce either an IronPython specific string class which
> is similar to CPython's PyUnicodeObject or we could have multiple distinct
> .NET types (IronPython.Runtime.AsciiString, System.String, and 
> IronPython.Runtime.Ucs4String) which all appear as the same type to Python. 
> 
> But when Python is calling a .NET API it's always going to return a 
> System.String 
> which is UTF-16.  If we had to check and convert all of those strings when 
> they 
> cross into Python it would be very bad for performance.  Presumably we could
> have a 4th type of "interop" string which lazily computes this but if we start
> wrapping .Net strings we could also get into object identity issues.
> 
> We could stop using System.String in IronPython all together and say when 
> working w/ .NET strings you get the .NET behavior and when working w/ Python 
> strings you get the Python behavior.  I'm not sure how weird and confusing 
> that 
> would be but conversion from an Ipy string to a .NET string could remain 
> cheap if 
> both were UTF-16, and conversions from .NET strings to Ipy strings would only 
> happen if the user did so explicitly.  
> 
> But it's a huge change - it'll almost certainly touch every single source 
> file in 
> IronPython.  I would think we'd get 3.2 done first and then think about what 
> to
> do here.
> 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

2011-08-26 Thread Martin v. Löwis

> But strings are allocated via PyObject_Malloc(), i.e. the custom
> arena-based allocator -- isn't its overhead (for small objects) less
> than 2 pointers per block?

Ah, right, I missed that. Indeed, those have no header, and the only
overhead is the padding to a multiple of 8.

That shifts the picture; I hope the table below is correct,
assuming ASCII strings.
3.2: 7 pointers (adds 4 bytes padding on 32-bit systems)
393: 10 pointers

string | 32-bit pointer | 32-bit pointer | 64-bit pointer
size   | 16-bit wchar_t | 32-bit wchar_t | 32-bit wchar_t
   | 3.2 |  393 | 3.2|  393  | 3.2|  393  |
---
1  | 40  | 48   | 40 |  48   | 64 | 88|
2  | 40  | 48   | 48 |  48   | 72 | 88|
3  | 40  | 48   | 48 |  48   | 72 | 88|
4  | 48  | 48   | 56 |  48   | 80 | 88|
5  | 48  | 48   | 56 |  48   | 80 | 88|
6  | 48  | 48   | 64 |  48   | 88 | 88|
7  | 48  | 48   | 64 |  48   | 88 | 88|
8  | 56  | 56   | 72 |  56   | 96 | 86|

So 1-byte strings increase in size; very short strings increase
on 16-bit-wchar_t systems and 64-bit systems. Short strings
keep there size, and long strings save.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Stefan Behnel


"Martin v. Löwis", 26.08.2011 11:29:

You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.


I think we can leave this discussion aside. Jython and IronPython have 
their own platform specific constraints to which they need to adapt their 
implementation. For a Jython user, it means a lot to be able to efficiently 
pass strings (and other data) back and forth between Jython and other JVM 
code, and it's not hard to guess that the same is true for IronPython/.NET 
users. After all, the platform integration is the very *reason* for most 
users to select one of these implementations.


Besides, what if these implementations provided indexing in, say, O(log N) 
instead of O(1) or O(N), e.g. by building a tree index into each string? 
You could have an index that simply marks runs of surrogate pairs and BMP 
substrings, thus providing a likely-to-be-somewhat-compact index. That 
index would obviously have to be built, but so do the different string 
representations in post-PEP-393 CPython, especially on Windows, as I have 
learned.


Would such a less severe violation of the strict O(1) rule still be "not 
ok"? I think this is not such a clear black-and-white issue. Both 
implementations have notably different performance characteristics than 
CPython in some more or less important areas, as does PyPy. At some point, 
the language compliance label has to account for that.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Martin v. Löwis

> IronPython and Jython can retain UTF-16 as their native form if that
> makes interop cleaner, but in doing so they need to ensure that basic
> operations like indexing and len work in terms of code points, not
> code units, if they are to conform.

That means that they won't conform, period. There is no efficient
maintainable implementation strategy to achieve that property, and
it may take well years until somebody provides an efficient
unmaintainable implementation.

> Does this make sense, or have I completely misunderstood things?

You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.

However, non-conformance may not be that much of an issue. They do not
conform in many other aspects, either (such as not supporting Python 3,
for example, or not supporting the C API) that they may well chose to
ignore such a minor requirement if there was one. For BMP strings,
they conform fine, and it may well be that Jython eithers either don't
have non-BMP strings, or don't care whether len() or indexing of their
non-BMP strings is "correct".

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Ezio Melotti

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum  wrote:

> On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland 
> wrote:
> > On Thu, 25 Aug 2011, Guido van Rossum wrote:
> >
> >> I'm not sure what should happen with UTF-8 when it (in flagrant
> >> violation of the standard, I presume) contains two separately-encoded
> >> surrogates forming a valid surrogate pair; probably whatever the UTF-8
> >> codec does on a wide build today should be good enough.
>

Surrogates are used and valid only in UTF-16.
In UTF-8/32 they are invalid, even if they are in pair (see
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ).  Of course Python
can/should be able to represent them internally regardless of the build
type.

>>Similarly for
> >> encoding to UTF-8 on a wide build if one managed to create a string
> >> containing a surrogate pair. Basically, I'm for a
> >> garbage-in-garbage-out approach (with separate library functions to
> >> detect garbage if the app is worried about it).
> >
> > If it's called UTF-8, there is no decision to be taken as to decoder
> > behaviour - any byte sequence not permitted by the Unicode standard must
> > result in an error (although, of course, *how* the error is to be
> reported
> > could legitimately be the subject of endless discussion).
>

What do you mean?  We use the "strict" error handler by default and we can
specify other handlers already.

>  There are
> > security implications to violating the standard so this isn't just
> > legalistic purity.
>
> You have a point. The security issues cannot be seen separate from all
> the other issues. The folks inside Google who care about Unicode often
> harp on this. So I stand corrected. I am fine with codecs treating
> code points or code point sequences that the Unicode standard doesn't
> like (e.g. lone surrogates) the same way as more severe errors in the
> encoded bytes (lots of byte sequences already aren't valid UTF-8).

Codecs that use the official names should stick to the standards.  For
example s.encode('utf-32') should either produce a valid utf-32 byte string
or raise an error if 's' contains invalid characters (e.g. surrogates).
We can have other internal codecs that are based on the UTF-* encodings but
allow the representation of lone surrogates and even expose them if we want,
but they should have a different name (even 'utf-*-something' should be ok,
see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't
put surrogates or noncharacters in a UTF-anything stream.").

> I
> just hope this doesn't require normal forms or other expensive
> operations; I hope it's limited to rejecting invalid use of surrogates
> or other values that are not valid code points (e.g. 0, or >= 2**21).
>

I think there shouldn't be any normalization done automatically by the
codecs.

>
> > Hmmm, doesn't look good:
> >
> > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> > Type "help", "copyright", "credits" or "license" for more information.
> 
>  '\xed\xb0\x80'.decode ('utf-8')
> >
> > u'\udc00'
> 
> >
> > Incorrect!  Although this is a narrow build - I can't say what the wide
> > build would do.
>

The UTF-8 codec used to follow RFC 2279 and only recently has been updated
to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ).  On Python
2.x it still produces invalid UTF-8 because changing it is backward
incompatible.  In Python 2 UTF-8 can be used to encode every codepoint from
0 to 10, and it always works.  If we change it now it might start
raising errors for an operation that never raised them before (see
http://bugs.python.org/issue12729#msg142047 ).
Luckily this is fixed in Python 3.x.
I think there are more codepoints/byte sequences that should be rejected
while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't
looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2
(if applicable), so if you find mismatches with the Unicode standard and
report an issue, feel free to assign it to me).

Best Regards,
Ezio Melotti

>
> > For reasons of practicality, it may be appropriate to provide easy access
> to
> > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not
> be
> > called UTF-8.  Other variations may also find use if provided.
> >
> > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
> >
> > And CESU-8 technical report: http://www.unicode.org/reports/tr26/
>
> Thanks for the links! I also like the term "supplemental character" (a
> code point >= 2**16). And I note that they talk about characters were
> we've just agreed that we should say code points...
>
> --
> --Guido van Rossum (python.org/~guido )
> ___
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread M.-A. Lemburg

Stefan Behnel wrote:
> Isaac Morland, 26.08.2011 04:28:
>> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>>> I'm not sure what should happen with UTF-8 when it (in flagrant
>>> violation of the standard, I presume) contains two separately-encoded
>>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>>> codec does on a wide build today should be good enough. Similarly for
>>> encoding to UTF-8 on a wide build if one managed to create a string
>>> containing a surrogate pair. Basically, I'm for a
>>> garbage-in-garbage-out approach (with separate library functions to
>>> detect garbage if the app is worried about it).
>>
>> If it's called UTF-8, there is no decision to be taken as to decoder
>> behaviour - any byte sequence not permitted by the Unicode standard must
>> result in an error (although, of course, *how* the error is to be
>> reported
>> could legitimately be the subject of endless discussion). There are
>> security implications to violating the standard so this isn't just
>> legalistic purity.
>>
>> Hmmm, doesn't look good:
>>
>> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
>> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> '\xed\xb0\x80'.decode ('utf-8')
>> u'\udc00'
>> >>>
>>
>> Incorrect! Although this is a narrow build - I can't say what the wide
>> build would do.
> 
> Works the same for me in a wide Py2.7 build, but gives me this in Py3:
> 
> Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
> [GCC 4.4.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
 b'\xed\xb0\x80'.decode ('utf-8')
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
> illegal encoding
> 
> Same for current Py3.3 and the PEP393 build (although both have a better
> exception message now: "UnicodeDecodeError: 'utf8' codec can't decode
> bytes in position 0-1: invalid continuation byte").

The reason for this is that the UTF-8 codec in Python 2.x
has never rejected lone surrogates and it was used to
store Unicode literals in pyc files (using marshal)
and also by pickle for transferring Unicode strings,
so we could simply reject lone surrogates, since this
would have caused compatibility problems.

That change was made in Python 3.x by having a special
error handler surrogatepass which allows the UTF-8
codec to process lone surrogates as well.

BTW: I'd love to join the discussion about PEP 393, but
unfortunately I'm swamped with work, so these are just
a few comments...

What I'm missing in the discussion is statistics of the
effects of the patch (both memory and performance) and
the effect on 3rd party extensions.

I'm not convinced that the memory/speed tradeoff is worth the
breakage or whether the patch actually saves memory in real world
applications and I'm unsure whether the needed code changes to
the binary Python Unicode API can be done in a minor Python
release.

Note that in the worst case, a PEP 393 Unicode object will
save three versions of the same string, e.g. on Windows
with sizeof(wchar_t)==2: A UCS4 version in str,
a UTF-8 version in utf8 (this gets build whenever Python needs
a UTF-8 version of the Object) and a wchar_t version in wstr
(which gets build whenever Python codecs or extensions need
Py_UNICODE or a wchar_t representation).
On all platforms, in the case where you store a Latin-1
non-ASCII string: str holds the Latin-1 string, utf8 the
UTF-8 version and wstr the 2- or 4-bytes wchar_t version.

* A note on terminology: Python stores Unicode as code points.

A Unicode "code point" refers to any value in the Unicode code
range which is 0 - 0x10. Lone surrogates, unassigned
and illegal code points are all still code points - this is
a detail people often forget. Various code points in Unicode
have special meanings and some are not allowed to be
used in encodings, but that does not make them rule them
out from being stored and processed as code points.

Code units are only used in encoded versions Unicode, e.g.
the UTF-8, -16, -32. Mixing code units and code points
can cause much confusion, so it's better to talk only
about code point when referring to Python Unicode objects,
since you only ever meet code units when looking at the
the bytes output of the codecs.

This is important to know, since Python is not only meant
to process Unicode, but also to build Unicode strings, so
a careful distinction has to be made when considering what
is correct and what not: codecs have to follow much more
strict rules than Python itself.

* A note on surrogates: These are just one particular problem
where you run into the situation where splitting a Unicode
string potentially breaks a combination of code points.
There are a few other types of code points that cause similar
problems, e.g. combining code points.

Simply going with UCS-4 does not solve the problem, since
even with UCS-4 storage

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread Paul Moore

On 26 August 2011 03:52, Guido van Rossum  wrote:
> I know that by now I am repeating myself, but I think it would be
> really good if we could get rid of this ambiguity. PEP 393 seems the
> best way forward, even if it doesn't directly address what to do for
> IronPython or Jython, both of which have to deal with a pervasive
> native string type that contains UTF-16.

Hmm, I'm completely naive in this area, but from reading the thread,
would a possible approach be to say that Python (the language
definition) is defined in terms of code points (as we already do, even
if the wording might benefit from some clarification). Then, under PEP
393, and currently in wide builds, CPython conforms to that definition
(and retains the property of basic operations being O(1), which is not
in the language definition but is a user expectation and your
expressed requirement).

IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform. Presumably this will be easier
than moving to a UCS-4 representation, as they can defer to runtime
support routines via interop (which presumably get this right - or at
the very least can be blamed for any errors :-)) They lose the O(1)
guarantee, but that's easily defensible as a tradeoff to conform to
underlying runtime semantics.

Does this make sense, or have I completely misunderstood things?

Paul.

PS Thanks to all for the discussion in general, I'm learning a lot
about Unicode from all of this!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] issue 6721 "Locks in python standard library should be sanitized on fork"

2011-08-26 Thread Nir Aides

Another face of the discussion is about whether to deprecate the mixing of
the threading and processing modules and what to do about the
multiprocessing module which is implemented with worker threads.



On Tue, Aug 23, 2011 at 11:29 PM, Antoine Pitrou wrote:

> Le mardi 23 août 2011 à 22:07 +0200, Charles-François Natali a écrit :
> > 2011/8/23 Antoine Pitrou :
> > > Well, I would consider the I/O locks the most glaring problem. Right
> > > now, your program can freeze if you happen to do a fork() while e.g.
> > > the stderr lock is taken by another thread (which is quite common when
> > > debugging).
> >
> > Indeed.
> > To solve this, a similar mechanism could be used: after fork(), in the
> > child process:
> > - just reset each I/O lock (destroy/re-create the lock) if we can
> > guarantee that the file object is in a consistent state (i.e. that all
> > the invariants hold). That's the approach I used in my initial patch.
>
> For I/O locks I think that would work.
> There could also be a process-wide "fork lock" to serialize locks and
> other operations, if we want 100% guaranteed consistency of I/O objects
> across forks.
>
> > - call a fileobject method which resets the I/O lock and sets the file
> > object to a consistent state (in other word, an atfork handler)
>
> I fear that the complication with atfork handlers is that you have to
> manage their lifecycle as well (i.e., when an IO object is destroyed,
> you have to unregister the handler).
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

68 matches

Mail list logo