date:20100622

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
> 
>  > One comment here -- you can also have uri's that aren't decodable into 
> their
>  > true textual meaning using a single encoding.
>  > 
>  > Apache will happily serve out uris that have utf-8, shift-jis, and
>  > euc-jp components inside of their path but the textual
>  > representation that was intended will be garbled (or be represented
>  > by escaped byte sequences).  For that matter, apache will serve
>  > requests that have no true textual representation as it is working
>  > on the byte level rather than the character level.
> 
> Sure.  I've never seen that combination, but I have seen Shift JIS and
> KOI8-R in the same path.
> 
> But in that case, just using 'latin-1' as the encoding allows you to
> use the (unicode) string operations internally, and then spew your
> mess out into the world for someone else to clean up, just as using
> bytes would.
> 
This is true.  I'm giving this as a real-world counter example to the
assertion that URIs are "text".  In fact, I think you're confusing things
a little by asserting that the RFC says that URIs are text.  I'll address
that in two sections down.

>  > So a complete solution really should allow the programmer to pass
>  > in uris as bytes when the programmer knows that they need it.
> 
> Other than passing bytes into a constructor, I would argue if a
> complete solution requires, eg, an interface that allows
> urljoin(base,subdir) where the types of base and subdir are not
> required to match, then it doesn't belong in the stdlib.  For stdlib
> usage, that's premature optimization IMO.
> 
I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
urljoin(u_base, u_subdir) => unicode be acceptable though?  (I think, given
other options, I'd rather see two separate functions, though.  It seems more
discoverable and less prone to taking bad input some of the time to have two
functions that clearly only take one type of data apiece.)

> The RFC says that URIs are text, and therefore they can (and IMO
> should) be operated on as text in the stdlib.

If I'm reading the RFC correctly, you're actually operating on two different
levels here.  Here's the section 2 that you quoted earlier, now in its
entirety::
2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters.  The URI
   characters are, in turn, frequently encoded as octets for transport or
   presentation.  This specification does not mandate any particular
   character encoding for mapping between URI characters and the octets used
   to store or transmit those characters.  When a URI appears in a protocol
   element, the character encoding is defined by that protocol; without such
   a definition, a URI is assumed to be in the same character encoding as
   the surrounding text.

   The ABNF notation defines its terminal values to be non-negative integers
   (codepoints) based on the US-ASCII coded character set [ASCII].  Because
   a URI is a sequence of characters, we must invert that relation in order
   to understand the URI syntax.  Therefore, the integer values used by the
   ABNF must be mapped back to their corresponding characters via US-ASCII
   in order to complete the syntax rules.

   A URI is composed from a limited set of characters consisting of digits,
   letters, and a few graphic symbols.  A reserved subset of those
   characters may be used to delimit syntax components within a URI while
   the remaining characters, including both the unreserved set and those
   reserved characters not acting as delimiters, define each component's
   identifying data.

So here's some data that matches those terms up to actual steps in the
process::

  # We start off with some arbitrary data that defines a resource.  This is
  # not necessarily text.  It's the data from the first sentence:
  data = b"\xff\xf0\xef\xe0"

  # We encode that into text and combine it with the scheme and host to form
  # a complete uri.  This is the "URI characters" mentioned in section #2.
  # It's also the "sequence of characters mentioned in 1.1" as it is not
  # until this point that we actually have a URI.
  uri = b"http://host/"; + percentencoded(data)
  # 
  # Note1: percentencoded() needs to take any bytes or characters outside of
  # the characters listed in section 2.3 (ALPHA / DIGIT / "-" / "." / "_"
  # / "~") and percent encode them.  The URI can only consist of characters
  # from this set and the reserved character set (2.2).
  #
  # Note2: in this simplistic example, we're only dealing with one piece of
  # data.  With multiple pieces, we'd need to combine them with separators,
  # for instance like this:
  # uri = b'http://host/' + percentencoded(data1) + b'/'
  # + percentencoded(data2)
  #
  # Note3: at this point, the uri could be stored as unicode or bytes in
  # python3.  It doesn't matter.  It

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz

On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

> The RFC says that URIs are text, and therefore they can (and IMO
> should) be operated on as text in the stdlib.

No, *blue* is the best color for a shed.

Oops, wait, let me try that again.

While I broadly agree with this statement, it is really an oversimplification.  
An URI is a structured object, with many different parts, which are transformed 
from bytes to ASCII (or something latin1-ish, which is really just bytes with a 
nice face on them) to real, honest-to-goodness text via the IRI specification: 
.

> Note also that the "complete solution" argument cuts both ways.  Eg, a
> "complete" solution should implement UTS 39 "confusables detection"[1]
> and IDNA[2].  Good luck doing that with bytes!

And good luck doing that with just characters, too.  You need a parsed 
representation of the URI that you can encode different parts of in different 
ways.  (My understanding is that you should only really implement confusables 
detection in the netloc... while that may be a bogus example, you're certainly 
only supposed to do IDNA in the netloc!)

You can just call urlsplit() all over the place to emulate this, but this does 
not give you the ability to go back to the original bytes, and thereby preserve 
things like brokenly-encoded segments, which seems to be what a lot of this 
hand-wringing is about.

To put it another way, there is no possible information-preserving string or 
bytes type that will make everyone happy as a result from urljoin().  The only 
return-type that gives you *everything* is "URI".

> just using 'latin-1' as the encoding allows you to
> use the (unicode) string operations internally, and then spew your
> mess out into the world for someone else to clean up, just as using
> bytes would.

This is the limitation that everyone seems to keep dancing around.  If you are 
using the stdlib, with functions that operate on sequences like 'str' or 
'bytes', you need to choose from one of three options:

  1. "decode" everything to latin1 (although I prefer to call it "charmap" when 
used in this way) so that you can have some mojibake that will fool a function 
that needs a unicode object, but not lose any information about your input so 
that it can be transformed back into exact bytes (and be very careful to never 
pass it somewhere that it will interact with real text!),
  2. actually decode things to an appropriate encoding to be displayed to the 
user and manipulated with proper text-manipulation tools, and throw away 
information about the bytes,
  3. keep both the bytes and the characters together (perhaps in a data 
structure) so that you can both display the data and encode it in 
situationally-appropriate ways.

The stdlib as it is today is not going to handle the 3rd case for anyone.  I 
think that's fine; it is not the stdlib's job to solve everyone's problems.  
I've been happy with it providing correctly-functioning pieces that can be used 
to build more elaborate solutions.  This is what I meant when I said I agree 
with Stephen's first point: the stdlib *should* just keep operating entirely on 
strings, because URIs are defined, by the spec, to be sequences of ASCII 
characters.  But that's not the whole story.

PJE's "bstr" and "ebytes" proposals set my teeth on edge.  I can totally 
understand the motivation for them, but I think it would be a big step 
backwards for python 3 to succumb to that temptation, even in the form of a 
third-party library.  It is really trying to cram more information into a pile 
of bytes than truly exists there.  (Also, if we're going to have encodings 
attached to bytes objects, I would very much like to add "JPEG" and "FLAC" to 
the list of possibilities.)

The real tension there is that WSGI is desperately trying to avoid defining any 
data structures (i.e. classes), while still trying to work with structured 
data.  An URI class with a 'child' method could handily solve this problem.  
You could happily call IRI(...).join(some bytes).join(some text) and then just 
say "give me some bytes, it's time to put this on the network", or "give me 
some characters, I have to show something to the user", or even "give me some 
characters appropriate for an 'href=' target in some HTML I'm generating" - 
although that last one could be left to the HTML generator, provided it could 
get enough information from the URI/IRI object's various parts itself.

I don't mean to pick on WSGI, either.  This is a common pain-point for porting 
software to 3.x - you had a string, it kinda worked most of the time before, 
but now you need to keep track of text too and the functions which seemed to 
work on bytes no longer do.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Stephen J. Turnbull

Michael Urman writes:

 > It is somewhat troublesome that there doesn't appear to be an obvious
 > built-in idempotent-when-possible function that gives back the
 > provided bytes/str,

If you want something idempotent, it's already the case that
bytes(b'abc') => b'abc'.  What might be desirable is to make
bytes('abc') work and return b'abc', but only if 'abc' is pure ASCII
(or maybe ISO 8859/1).

Unfortunately, str(b'abc') already does work, but

st...@uwakimon ~ $ python3.1
Python 3.1.2 (release31-maint, May 12 2010, 20:15:06) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> str(b'abc')
"b'abc'"
>>> 

Oops.  You can see why that probably "should" be the case.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] UserDict in 2.7

2010-06-22 Thread Raymond Hettinger

There's an entry in whatsnew for 2.7 to the effect of "The UserDict class is 
now a new-style class".

I had thought there was a conscious decision to not change any existing classes 
from old-style to new-style.  IIRC, Martin had championed this idea and had 
rejected all of proposals to make existing classes inherit from object.


Raymond

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Ronald Oussoren


On 21 Jun, 2010, at 22:25, Antoine Pitrou wrote:

> Le lundi 21 juin 2010 à 21:13 +0100, Michael Foord a écrit :
>> 
>> If OS X is a supported and important platform for Python then fixing all 
>> problems that it reveals (or being willing to) should definitely not be 
>> a pre-requisite of providing a buildbot (which is already a service to 
>> the Python developer community). Fixing bugs / failures revealed by 
>> Bill's buildbot is not fixing them "for Bill" it is fixing them for Python.
> 
> I didn't say it was a prerequisite. I was merely pointing out that when
> platform-specific bugs appear, people using the specific platform should
> be helping if they want to actually encourage the fixing of these bugs.
> 
> OS X is only "a supported and important platform" if we have dedicated
> core developers diagnosing or even fixing issues for it (like we
> obviously have for Windows and Linux). Otherwise, I don't think we have
> any moral obligation to support it.

I look into and fix OSX issues, but do so in my spare time. This means it can 
take a while until I get around doing so.

Ronald

P.S. Please file bugs for issues on OSX and set the compontent to Macintosh 
instead of discussing them on python-dev. I don't read python-dev on a daily 
basis almost missed this thread.

smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Raymond Hettinger


On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:
>   This is a common pain-point for porting software to 3.x - you had a string, 
> it kinda worked most of the time before, but now you need to keep track of 
> text too and the functions which seemed to work on bytes no longer do.

Thanks Glyph.  That is a nice summary of one kind of challenge facing 
programmers.


Raymond

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Stephen J. Turnbull

P.J. Eby writes:

 > I know, it's a hard thing to wrap one's head around, since on the
 > surface it sounds like unicode is the programmer's savior.

I don't need to wrap my head around it.  It's been deeply embedded,
point first, and the nasty barbs ensure that I have no desire to pull
it back out.

To wit, I've been dealing with Japanese encoding issues on a daily
basis for 20 years, and I'm well aware that programmers have several
good reasons (and a lot more bad ones) for avoiding them, and even for
avoiding Unicode when they must deal with encodings at all.  I don't
think any of the good reasons have been offered here yet, that's all.

 > Unfortunately, real-world text data exists which cannot be safely
 > roundtripped to unicode, and must be handled in "bytes with
 > encoding" form for certain operations.

Or "Unicode with encoding" form.  See below for why this makes sense in
the context of Python.

 > I personally do not have to deal with this *particular* use case any 
 > more -- I haven't been at NTT/Verio for six years now.

As mentioned, I have a bit of understanding of the specific problems
of Japanese-language computing.  In particular, roundtripping Japanese
from *any* encoding to *any other* encoding is problematic, because
the national standards provide a proper subset of the repertoire
actually used by the Japanese people.  (Even JIS X 0213.)

 > My current needs are simpler, thank goodness.  ;-)  However, they 
 > *do* involve situations where I'm dealing with *other* 
 > encoding-restricted legacy systems, such as software for interfacing 
 > with the US Postal Service that only works with a restricted subset 
 > of latin1, while receiving mangled ASCII from an ecommerce provider, 
 > and storing things in what's effectively a latin-1 database.

Yes, I know of similar issues in other applications.  For example, TeX
error messages do not respect UTF-8 character boundaries, so Emacs has
to handle them specially (basically a mechanism similar in spirit to
PEP 383 is used).

 > Being able to easily assert what kind of bytes I've got would
 > actually let me catch errors sooner, *if* those assertions were
 > being checked when different kinds of strings or bytes were being
 > combined.  i.e., at coercion time).

I see that this would make life a little easier for you in maintaining
without refactoring.  I'd say it's a kludge, but without a full list
of requirements I'm in no position to claim any authority .  Eg,
for a non-kludgey suggestion, how about defining a codec which takes
Latin-1 bytes, checks (with error on failure) for the restricted
subset, and converts to str?  Then you can manipulate these things as
str with abandon internally.  Finally you get another check in the
outgoing codec which converts from str to "effective Latin-1 bytes",
however that is defined.

But OK, maybe I'm just being naive.  You need this unlovely artifice
so you can put in asserts in appropriate places.  Now, does it belong
in the stdlib?

It seems to me that in the case of Japanese roundtripping, *most* of
the time encoding back to a standard Japanese encoding will work.  If
you run into one of the problematic characters that JIS doesn't allow
but Japanese like to use because they prefer the glyph to the
JIS-standard glyph, you get an occasional error on encoding to a
standard Japanese encoding, which you handle specially with a database
of such characters.  Knowing the specific encoding originally used
*normally does not help unless you're replying to that person and
**only** that person*, because the extended repertoires vary widely
and the only standard is Japanese.  I conclude ebytes does *no* good
here.

For the ecommerce/USPS case, well, actually you need special-purpose
encodings anyway (ISTM).  'latin-1' loses, the USPS is allergic to
some valid 'latin-1' characters.  'ascii' loses, apparently you need
some of the Latin-1 repertoire, and anyway AIUI the ecommerce provider
munges the ASCII.  So what does ebytes actually buy you here, unless
you write the codecs?  If you've got the codecs, what additional
benefit do you get from ebytes?

Note that you would *also* need to do explicit transcoding anyway if
you were dealing with Japan Post instead of the USPS, although I grant
your code is probably general enough to deal with Deutsche Telecom
(but the German equivalent of your ecommerce provider probably has its
own ways of munging Latin-1).  I conclude that there may be genuine
benefits to ebytes here, but they're probably not general enough to
put in the stdlib (or the Python language).

 > Which works if and only if your outputs are truly unicode-able.

With PEP 383, they always are, as long as you allow Unicode to be
decoded to the same garbage your bytes-based program would have
produced anyway.

 > If you work with legacy systems (e.g. those Asian email clients and
 > US postal software), you are really working with a *character set*,
 > not unicode,

I think you're missing something.  Namely, Un

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull

Glyph Lefkowitz writes:
 > On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

 > > Note also that the "complete solution" argument cuts both ways.  Eg, a
 > > "complete" solution should implement UTS 39 "confusables detection"[1]
 > > and IDNA[2].  Good luck doing that with bytes!
 > 
 > And good luck doing that with just characters, too.

I agree with you, sorry.  I meant to cast doubt on the idea of
complete solutions, or at least claims that completeness is an excuse
for putting it in the stdlib.

 > This is the limitation that everyone seems to keep dancing around.
 > If you are using the stdlib, with functions that operate on
 > sequences like 'str' or 'bytes', you need to choose from one of
 > three options: 

There's a *fourth* way: specially designed codecs to preserve as much
metainformation as you need, while always using the str format
internally.  This can be done for at least 100,000 separate
(character, encoding) pairs by multiplexing into private space with an
auxiliary table of encodings and equivalences.  That's probably
overkill.  In many cases, adding simple PEP 383 mechanism (to preserve
uninterpreted bytes) might be enough though, and that's pretty
plausible IMO.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] adding new function

2010-06-22 Thread lesni bleble

hello,

how can i simply add new functions to module after its initialization
(Py_InitModule())?  I'm missing something like
PyModule_AddCFunction().

thank you

L.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] adding new function

2010-06-22 Thread Daniel Fetchinson

> how can i simply add new functions to module after its initialization
> (Py_InitModule())?  I'm missing something like
> PyModule_AddCFunction().

This type of question really belongs to python-list aka
comp.lang.python which I CC-d now. Please keep the discussion on that
list.

Cheers,
Daniel


-- 
Psss, psss, put it down! - http://www.cafepress.com/putitdown
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Nick Coghlan

On Tue, Jun 22, 2010 at 4:49 PM, Stephen J. Turnbull  wrote:
>  > Which works if and only if your outputs are truly unicode-able.
>
> With PEP 383, they always are, as long as you allow Unicode to be
> decoded to the same garbage your bytes-based program would have
> produced anyway.

Could it be that part of the problem here is that we need to better
advertise "errors='surrogateescape'" as a mechanism for decoding
incorrectly encoded data according to a nominal codec without throwing
UnicodeDecode and UnicodeEncode errors all over the place? Currently
it only garners a mention in the docs in the context of the os module,
the list of error handlers in the codecs module and as a default error
handler argument in the tarfile module.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] [OT] glyphs [was Re: email package status in 3.X]

2010-06-22 Thread Steven D'Aprano

On Tue, 22 Jun 2010 11:46:27 am Terry Reedy wrote:
> 3. Unicode disclaims direct representation of glyphic variants
> (though again, exceptions were made for asian acceptance). For
> example, in English, mechanically printed 'a' and 'g' are different
> from manually printed 'a' and 'g'. Representing both by the same
> codepoint, in itself, loses information. One who wishes to preserve
> the distinction must instead use a font tag or perhaps a
>  tag. Similarly, older English had a significantly
> different glyph for 's', which looks more like a modern 'f'.

An unfortunate example, as the old English long-s gets its own Unicode 
codepoint.

http://en.wikipedia.org/wiki/Long_s


-- 
Steven D'Aprano
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull

Toshio Kuratomi writes:

 > I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
 > urljoin(u_base, u_subdir) => unicode be acceptable though?

Probably.  

But it doesn't matter what I say, since Guido has defined that as
"polymorphism" and approved it in principle.

 > (I think, given other options, I'd rather see two separate
 > functions, though.

Yes.

 > If you want to deal with things like this::
 >   http://host/café

Yes.

 > At that point you are no longer dealing with the sequence of
 > characters talked about in the RFC.  You are dealing with data
 > which may or may not be text.

That's right, and I think that in most cases that is what programmers
want to be dealing with.  Let the library make sure that what goes on
the wire conforms to the RFC.  I don't want to know about it, I want
to work with the content of the URI.

 > The proliferation of encoding I agree is a thing that is ugly.
 > Although, if I'm thinking correctly, that only matters when you
 > want to allow mixing bytes and unicode, correct?

Well you need to know a fair amount about the encoding: that the
reserved bytes are used as defined in the RFC, for example.

 > For debugging, I'm either not understanding or you're wrong.  If I'm given
 > an arbitrary sequence of bytes how do I sanely store them as str internally?

If it's really arbitrary, you use either a mapping to private space or
PEP 383, and accept that it won't make sense.  But in most cases you
should be able to achieve a fair degree of sanity.

 > If I transform them using an encoding that anticipates the full range of
 > bytes I may be able to display some representation of them but it's not
 > necessarily the sanest method of display (for instance, if I know that path
 > element 1 is always going to be a utf8 encoded string and path element 2 is
 > always shift-jis encoded, and path element 3 is binary data, I could
 > construct a much saner display method than treating the whole thing as
 > latin1).

And I think in most cases you will know, although the cases where
you'll know will be because of a system-wide encoding.

 > What is your basis for asserting that URIs that aren't sanely treated as
 > text are garbage?

I don't mean we can throw them away, I mean we can't do any sensible
processing on them.  You at least need to know about the reseved
delimiters.  In the same way that Philip used 'garbage' for the
"unknown" encoding.  And in the sense of "garbage in, garbage out".

 > unicode handling redesign.  I'm stating my reading of the RFC not to defend
 > the use case Philip has, but because I think that the outlook that non-text
 > uris (before being percentencoded) are violations of the RFC

That's not what I'm saying.  What I'm trying to point out is that
manipulating a bytes object as an URI sort of presumes a lot about its
encoding as text.  Since many of the URIs we deal with are more or
less textual, why not take advantage of that?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Stephen J. Turnbull

Nick Coghlan writes:
 > On Tue, Jun 22, 2010 at 4:49 PM, Stephen J. Turnbull  
 > wrote:
 > >  > Which works if and only if your outputs are truly unicode-able.
 > >
 > > With PEP 383, they always are, as long as you allow Unicode to be
 > > decoded to the same garbage your bytes-based program would have
 > > produced anyway.
 > 
 > Could it be that part of the problem here is that we need to better
 > advertise "errors='surrogateescape'" as a mechanism for decoding
 > incorrectly encoded data according to a nominal codec without throwing
 > UnicodeDecode and UnicodeEncode errors all over the place?

Yes, I think that would make the "use str internally to urllib"
strategy a lot more palatable.  But it still needs to be combined with
a program architecture of decode-process-encode, which might require
substantial refactoring for some existing modules.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Fred Drake

On Tue, Jun 22, 2010 at 2:21 AM, Raymond Hettinger
 wrote:
> I had thought there was a conscious decision to not change any existing
> classes from old-style to new-style.

I thought so as well.  Changing any class from old-style to new-style
risks breaking applications in obscure & mysterious ways.  (Yes, we've
been bitten by this before; it's a real problem.)


  -Fred

-- 
Fred L. Drake, Jr.
"A storm broke loose in my mind."  --Albert Einstein
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Benjamin Peterson

2010/6/22 Raymond Hettinger :
> There's an entry in whatsnew for 2.7 to the effect of "The UserDict class is
> now a new-style class".
> I had thought there was a conscious decision to not change any existing
> classes from old-style to new-style.  IIRC, Martin had championed this idea
> and had rejected all of proposals to make existing classes inherit from
> object.

IIRC this was because UserDict tries to be a MutableMapping but abcs
require new style classes.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Laurens Van Houtven

On Tue, Jun 22, 2010 at 2:40 PM, Fred Drake  wrote:
> On Tue, Jun 22, 2010 at 2:21 AM, Raymond Hettinger
>  wrote:
>> I had thought there was a conscious decision to not change any existing
>> classes from old-style to new-style.
>
> I thought so as well.  Changing any class from old-style to new-style
> risks breaking applications in obscure & mysterious ways.  (Yes, we've
> been bitten by this before; it's a real problem.)
>
>
>  -Fred

+1. I've been bitten by this more than once in some of the more
obscure old(-style) classes in twisted.python.

Laurens
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Michael Urman

On Tue, Jun 22, 2010 at 00:28, Stephen J. Turnbull  wrote:
> Michael Urman writes:
>
>  > It is somewhat troublesome that there doesn't appear to be an obvious
>  > built-in idempotent-when-possible function that gives back the
>  > provided bytes/str,
>
> If you want something idempotent, it's already the case that
> bytes(b'abc') => b'abc'.  What might be desirable is to make
> bytes('abc') work and return b'abc', but only if 'abc' is pure ASCII
> (or maybe ISO 8859/1).

By idempotent-when-possible, I mean to_bytes(str_or_bytes, encoding,
errors) that would pass an instance of bytes through, or encode an
instance of str. And of course a to_str that performs similarly,
passing str through and decoding bytes. While bytes(b'abc') will give
me b'abc', neither bytes('abc') nor bytes(b'abc', 'latin-1') get me
the b'abc' I want to see.

These are trivial functions; I just don't fully understand why the
capability isn't baked in. A one argument call is idempotent capable;
a two argument call isn't as it only converts.

It's not a completely made-up requirement either. A cross-platform
piece of software may need to present to a user items that are
sometimes str and sometimes bytes - particularly filenames.

> Unfortunately, str(b'abc') already does work, but
>
> st...@uwakimon ~ $ python3.1
> Python 3.1.2 (release31-maint, May 12 2010, 20:15:06)
> [GCC 4.3.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
 str(b'abc')
> "b'abc'"

>
> Oops.  You can see why that probably "should" be the case

Sure, and I love having this there for debugging. But this is hardly
good enough for presenting to a user once you leave ascii.
>>> u = '日本語'
>>> sjis = bytes(u, 'shift-jis')
>>> utf8 = bytes(u, 'utf-8')
>>> str(sjis), str(utf8)
("b'\\x93\\xfa\\x96{\\x8c\\xea'",
"b'\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e'")

When I happen to know the encoding, I can reverse it much more cleanly.
>>> str(sjis, 'shift-jis'), str(utf8, 'utf-8')
('日本語', '日本語')

But I can't mix this approach with str instances without writing a
different invocation.
>>> str(u, 'argh')
TypeError: decoding str is not supported

-- 
Michael Urman
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum

[Just addressing one little issue here; generally I'm just happy that
we're discussing this issue in such detail from so many points of
view.]

On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi  wrote:
>[...] Would urljoin(b_base, b_subdir) => bytes and
> urljoin(u_base, u_subdir) => unicode be acceptable though?  (I think, given
> other options, I'd rather see two separate functions, though.  It seems more
> discoverable and less prone to taking bad input some of the time to have two
> functions that clearly only take one type of data apiece.)

Hm. I'd rather see a single function (it would be "polymorphic" in my
earlier terminology). After all a large number of string method calls
(and some other utility function calls) already look the same
regardless of whether they are handling bytes or text (as long as it's
uniform). If the building blocks are all polymorphic it's easier to
create additional polymorphic functions.

FWIW, there are two problems with polymorphic functions, though they
can be overcome:

(1) Literals.

If you write something like x.split('&') you are implicitly assuming x
is text. I don't see a very clean way to overcome this; you'll have to
implement some kind of type check e.g.

x.split('&') if isinstance(x, str) else x.split(b'&')

A handy helper function can be written:

  def literal_as(constant, variable):
  if isinstance(variable, str):
  return constant
  else:
  return constant.encode('utf-8')

So now you can write x.split(literal_as('&', x)).

(2) Data sources.

These can be functions that produce new data from non-string data,
e.g. str(), read it from a named file, etc. An example is read()
vs. write(): it's easy to create a (hypothetical) polymorphic stream
object that accepts both f.write('booh') and f.write(b'booh'); but you
need some other hack to make read() return something that matches a
desired return type. I don't have a generic suggestion for a solution;
for streams in particular, the existing distinction between binary and
text streams works, of course, but there are other situations where
this doesn't generalize (I think some XML interfaces have this
awkwardness in their API for converting a tree to a string).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jesse Noller wrote:
> 
> On Jun 19, 2010, at 10:13 AM, Tres Seaver  wrote:

>>> Nothing is set in stone; if something is incredibly painful, or worse
>>> yet broken, then someone needs to file a bug, bring it to this list,
>>> or bring up a patch.
>> Or walk away.
>>
> 
> Ok. If you want.

I specifically said I *didn't* want to walk away.  I'm pointing out that
in the general case, the ordinary user who finds something incredibly
painful or broken is far more likely to walk away from the platform than
try to fix it, especially if there are available alternatives (e.g.,
Ruby, Python 2) where the pain level for that user's application is lower.

>>> I guess tutorial welcome, rather than patch welcome then ;)
>> The only folks who can write the tutorial are the ones who have  
>> already drunk the koolaid.  Note that I've been making my living with Python 
>>  
>> for about twelve years now, and would *like* to use Python3, but can't,  
>> yet, and therefore haven't taken the first sip.
> 
> Why can't you? Is it a bug?

It's not *a* bug, it is that I do my day to day work on very large
applications which depend on a large number of not-yet-ported libraries.
 This barrier is the negative "network effect" which is the whole point
of this thread:  there is nothing wrong with Python3 except that, to use
it, I have to stop doing the work which pays to do an
indeterminately-large amount of "hobby" work (of which I already do
quite a lot).

> Let's file it and fix it. Is it that you  
> need a dependency ported?

I need dozens of them ported, and am working on some of them in the
aforementioned "copious spare time."

> Cool - let's bring it up to the maintainers,  
> or this list, or ask the PSF to push resources into helping port.  
> Anything but nothing.

Nothing is the default:  I am already successful with Python 2, and
can't be successfulwith Python 3 (in the sense of delivering timely,
cost-effective solutions to my customers) until *all* those dependencies
are ported and stable there.

> If what you're saying is that python 3 is a completely unsuitable  
> platform, well, then yeah - we can all "fix" it or walk away.

I didn't say that:  I said that Python 3 is unsuitable *today* for the
work I'm doing, and that the relative wins it provides over Python 2 are
dwarfed by the effort required to do all those ports myself.

 IOW, 3.x has broken TOOOWTDI for me in some areas.  There may
 be obvious ways to do it, but, as per the Zen of Python, "that
 way may not be obvious at first unless you're Dutch".  ;-)

OT:  The Dutch smiley there doesn't actually help anything but undercut
any point to having TOOOWTDI in the list at all.

>>> What areas. We need specifics which can either be:
>>>
>>> 1> Shot down.
>>> 2> Turned into bugs, so they can be fixed
>>> 3> Documented in the core documentation.

>> That's bloody ironic in a thread which had pointed at reasons why  
>> people are not even considering Py3 for their projects:  those folks won't  
>> even find the issues due to the lack of confidence in the suitability of  
>> the platform.
> 
> What I saw was a thread about some issues in email, and cgi. We have  
> some work being done to address the issue. This will help resolve some  
> of the issues.
> 
> If there are other issues, then we should step up and either help, or  
> get out ofthe way. Arguing about the viability of a platform we knew  
> would take a bit for adoption is silly and breeds ill will.

I'm not arguing about viability:  there are obviously users for whom
Python 3 is not only viable, but superior to Python 2.  However, I am
quite confident that many pro-Python 3 folks arguing here underestimate
the scope of the issues which have generated the (self-fullfilling) "not
yet" perception.

> It's not a turd, and it's not hopeless, in fact rumor has it NumPy  
> will be ported soon which is a major stepping stone.

Sure, for the (far from trivial) subset of the community doing numerical
work.

> The only way to counteract this meme that python 3 is horribly  
> broken is to prove that it's not, fix bugs, and move on. There's no  
> point debating relative turdiness here.

Any "turdiness" (which I am *not* arguing for) is a natural consequence
of the kinds of backward incompatibilities which were *not* ruled out
for Python 3, along with the (early, now waning) "build it and they will
 come" optimism about adoption rates.

Tres.
- --
===
Tres Seaver  +1 540-429-0999  [email protected]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwg5rIACgkQ+gerLs4ltQ6J7wCdFkQL7XeKtBM407Z5D2rSKk8n
EWYAoJUfW+JgURUz7NJcWmqFw3PkNYde
=WZEv
-END PGP SIGNATURE-

___
Pyt

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Ronald Oussoren


On 22 Jun, 2010, at 3:38, Alexander Belopolsky wrote:

> On Mon, Jun 21, 2010 at 6:16 PM, "Martin v. Löwis"  wrote:
>>> The test_posix failure is a regression from 2.6 (but it only shows up on
>>> some machines - it is caused by a fairly braindead implementation of a
>>> couple of posix apis by Apple apparently).
>>> 
>>> http://bugs.python.org/issue7900
>> 
>> Ah, that one. I definitely think this should *not* block the release:
> 
> I agree that this is nowhere near being a release blocker, but I think
> it would be nice to do something about it before the final release.
> 
>> a) there is no clear solution in sight. So if we wait for it resolved,
>>   it could take months until we get a 2.7 release.
> 
> The ideal solution will have to wait until Apple gets its act together
> and fixed the problem on their end.  I would say "months" is an overly
> optimistic time estimate for that.  

I'd say there is no chance at all that this will be fixed in OSX 10.6, with 
some luck they'll change this in 10.7.

> However, the issue is a regression
> from prior versions.  In 2.5 getgroups would truncate the list to 16
> groups, but won't crash.  More importantly the 16 groups returned
> would be correct per-process groups and not something immune to
> setgroup changes.
> 
> I proposed a very simple fix:
> 
> http://bugs.python.org/file16326/no-darwin-ext.diff
> 
> which simply minimally reverts the change that introduced the regression.

That is one way to fix it, another just as valid fix is to change 
posix.getgroups to be able to return more than 16 groups on OSX (see my patch 
in issue7900). 

Both are valid fixes, both have both advantages and disadvantages.

Your proposal:
* Reverts to the behavior in 2.6
* Ensures that posix.getgroups and posix.setgroups are internally consistent

My proposal:
* Uses the newer ABI, which is more likely to be the one Apple wants you to use
* Is compatible with system tools (that is, posix.getgroups() agrees with id(1))
* Is compatible with /usr/bin/python
* results in posix.getgroups not reflecting results of posix.setgroups

What I haven't done yet, and probably should, is to check how either 
implementation of getgroups interacts with groups in the System Preferences 
panel and with groups in managed environment (using OSX Server).

My gut feeling is that second option (my proposal) would give more useful 
semantics, but that said: I almost never write code where I need os.setgroups.

Ronald

smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] State of json in 2.7

2010-06-22 Thread Dirkjan Ochtman

It looks like simplejson 2.1.0 and 2.1.1 have been released:

http://bob.pythonmac.org/archives/2010/03/10/simplejson-210/
http://bob.pythonmac.org/archives/2010/03/31/simplejson-211/

It looks like any changes that didn't come from the Python tree didn't
go into the Python tree, either.

I guess we can't put these changes into 2.7 anymore? How can we make
this better next time?

Cheers,

Dirkjan
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] State of json in 2.7

2010-06-22 Thread Benjamin Peterson

2010/6/22 Dirkjan Ochtman :
> I guess we can't put these changes into 2.7 anymore? How can we make
> this better next time?

Never have externally maintained packages.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Raymond Hettinger

On Jun 22, 2010, at 5:48 AM, Benjamin Peterson wrote:

> 2010/6/22 Raymond Hettinger :
>> There's an entry in whatsnew for 2.7 to the effect of "The UserDict class is
>> now a new-style class".
>> I had thought there was a conscious decision to not change any existing
>> classes from old-style to new-style.  IIRC, Martin had championed this idea
>> and had rejected all of proposals to make existing classes inherit from
>> object.
> 
> IIRC this was because UserDict tries to be a MutableMapping but abcs
> require new style classes.

ISTM, this change should be reverted to the way it was in 2.6.

The registration was already working fine:

Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
>>> import UserDict
>>> import collections
>>> collections.MutableMapping.register(UserDict.UserDict)
>>> issubclass(UserDict.UserDict, collections.MutableMapping)
True

We've didn't have any problems with this registration
nor did there seem to be an issue with UserDict not 
implementing dictviews.

Please revert this change.  UserDicts have a long history
and are used by a lot of code, so we need to avoid
unnecessary breakage.

Thank you,

Raymond

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull wrote:

> Toshio Kuratomi writes:
>
>  > I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
>  > urljoin(u_base, u_subdir) => unicode be acceptable though?
>
> Probably.
>
> But it doesn't matter what I say, since Guido has defined that as
> "polymorphism" and approved it in principle.
>
>  > (I think, given other options, I'd rather see two separate
>  > functions, though.
>
> Yes.
>
>  > If you want to deal with things like this::
>  >   http://host/café 
>
> Yes.
>

Just for perspective, I don't know if I've ever wanted to deal with a URL
like that.  I know how it is supposed to work, and I know what a browser
does with that, but so many tools will clean that URL up *or* won't be able
to deal with it at all that it's not something I'll be passing around.  So
from a practical point of view this really doesn't come up, and if it did it
would be in a situation where you could easily do something ad hoc (though
there is not currently a routine to quote unsafe characters in a URL... that
would be helpful, though maybe urllib.quote(url.encode('utf8'), '%/:') would
do it).

Also while it is problematic to treat the URL-unquoted value as text
(because it has an unknown encoding, no encoding, or regularly a mixture of
encodings), the URL-quoted value is pretty easy to pass around, and
normalization (in this case to http://host/caf%C3%A9) is generally fine.

While it's nice to be correct about encodings, sometimes it is impractical.
And it is far nicer to avoid the situation entirely.  That is, decoding
content you don't care about isn't just inefficient, it's complicated and
can introduce errors.  The encoding of the underlying bytes of a %-decoded
URL is largely uninteresting.  Browsers (whose behavior drives a lot of
convention) don't touch any of that encoding except lately occasionally to
*display* some data in a more friendly way.  But it's only display, and
errors just make it revert to the old encoded display.

Similarly I'd expect (from experience) that a programmer using Python to
want to take the same approach, sticking with unencoded data in nearly all
situations.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Alexander Belopolsky

On Tue, Jun 22, 2010 at 12:39 PM, Ronald Oussoren
 wrote:
..
> Both are valid fixes, both have both advantages and disadvantages.
>
> Your proposal:
> * Reverts to the behavior in 2.6
> * Ensures that posix.getgroups and posix.setgroups are internally consistent
>
It is also very simple and since posix module worked fine on OSX for
years without _DARWIN_C_SOURCE, I think this is a very low risk
change.

> My proposal:
> * Uses the newer ABI, which is more likely to be the one Apple wants you to 
> use

I don't think so.  In getgroups(2) I see

LEGACY DESCRIPTION
 If _DARWIN_C_SOURCE is defined, getgroups() can return more than
{NGROUPS_MAX} groups.

This suggests that this is legacy behavior.  Newer applications should
use getgrouplist instead.

> * Is compatible with system tools (that is, posix.getgroups() agrees with 
> id(1))

I have not tested this recently, but I think if you exec id from a
program after a call to setgroups(), it will return process groups,
not user groups.

> * Is compatible with /usr/bin/python

I am sure that one this issue is fixed upstream, Apple will pick it up
with the next version.

> * results in posix.getgroups not reflecting results of posix.setgroups
>

This effectively substitutes getgrouplist called on the current user
for getgroups.  In 3.x, I believe the correct action will be to
provide direct access to getgrouplist which is while not POSIX (yet?),
is widely available.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Benjamin Peterson

2010/6/22 Raymond Hettinger :
>
> On Jun 22, 2010, at 5:48 AM, Benjamin Peterson wrote:
>
>> 2010/6/22 Raymond Hettinger :
>>> There's an entry in whatsnew for 2.7 to the effect of "The UserDict class is
>>> now a new-style class".
>>> I had thought there was a conscious decision to not change any existing
>>> classes from old-style to new-style.  IIRC, Martin had championed this idea
>>> and had rejected all of proposals to make existing classes inherit from
>>> object.
>>
>> IIRC this was because UserDict tries to be a MutableMapping but abcs
>> require new style classes.
>
> ISTM, this change should be reverted to the way it was in 2.6.
>
> The registration was already working fine:

Actually I believe it was an error that it could. There was a typo in
abc.py which prevented it from raising errors when non new-style class
objects were passed in.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Bill Janssen

Alexander Belopolsky  wrote:

> On Mon, Jun 21, 2010 at 9:26 PM, Bill Janssen  wrote:
> ..
> > Though, isn't that behavior of urllib.proxy_bypass another bug?
> 
> I don't know.  Ask Ronald.

Hmmm.  I brought up the System Preferences panel on my Mac, and sure
enough, there's a checkbox, "Exclude simple hostnames".  So I guess it's
not a bug, though none of my Macs are configured that way.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
>  > unicode handling redesign.  I'm stating my reading of the RFC not to defend
>  > the use case Philip has, but because I think that the outlook that non-text
>  > uris (before being percentencoded) are violations of the RFC
> 
> That's not what I'm saying.  What I'm trying to point out is that
> manipulating a bytes object as an URI sort of presumes a lot about its
> encoding as text.

I think we're more or less in agreement now but here I'm not sure.  What
manipulations are you thinking about?  Which stage of URI construction are
you considering?

I've just taken a quick look at python3.1's urllib module and I see that
there is a bit of confusion there.  But it's not about unicode vs bytes but
about whether a URI should be operated on at the real URI level or the
data-that-makes-a-uri level.

* all functions I looked at take python3 str rather than bytes so there's no
  confusing stuff here
* urllib.request.urlopen takes a strict uri.  That means that you must have
  a percent encoded uri at this point
* urllib.parse.urljoin takes regular string values
* urllib.parse and urllib.unparse take regular string values

> Since many of the URIs we deal with are more or
> less textual, why not take advantage of that?
>
Cool, so to summarize what I think we agree on:

* Percent encoded URIs are text according to the RFC.
* The data that is used to construct the URI is not defined as text by the
  RFC.
* However, it is very often text in an unspecified encoding
* It is extremely convenient for programmers to be able to treat the data
  that is used to form a URI as text in nearly all common cases.

-Toshio

pgpDvecDxPAjV.pgp
Description: PGP signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Guido van Rossum

On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger
 wrote:
>
> On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:
>
>   This is a common pain-point for porting software to 3.x - you had a
> string, it kinda worked most of the time before, but now you need to keep
> track of text too and the functions which seemed to work on bytes no longer
> do.
>
> Thanks Glyph.  That is a nice summary of one kind of challenge facing
> programmers.

Ironically, Glyph also described the pain in 2.x: it only "kinda" worked.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Raymond Hettinger

On Jun 22, 2010, at 10:08 AM, Benjamin Peterson wrote:
> . There was a typo in
> abc.py which prevented it from raising errors when non new-style class
> objects were passed in.

For 2.x, that was probably a good thing, a happy accident
that made it possible to register existing mapping classes
as a MutableMapping.

"Fixing" that typo will break code that currently uses ABCs
with old-style classes.  

I believe we are better-off leaving this as it was released in 2.6.

Raymond
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Guido van Rossum

On Mon, Jun 21, 2010 at 10:28 PM, Stephen J. Turnbull
 wrote:
> Michael Urman writes:
>
>  > It is somewhat troublesome that there doesn't appear to be an obvious
>  > built-in idempotent-when-possible function that gives back the
>  > provided bytes/str,
>
> If you want something idempotent, it's already the case that
> bytes(b'abc') => b'abc'.  What might be desirable is to make
> bytes('abc') work and return b'abc', but only if 'abc' is pure ASCII
> (or maybe ISO 8859/1).

No, no, no! That's just what Python 2 did.

> Unfortunately, str(b'abc') already does work, but
>
> st...@uwakimon ~ $ python3.1
> Python 3.1.2 (release31-maint, May 12 2010, 20:15:06)
> [GCC 4.3.4] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
 str(b'abc')
> "b'abc'"

>
> Oops.  You can see why that probably "should" be the case.

There is a near-contract that str() of pretty much anything returns a
"printable" version of that thing.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread James Y Knight



On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote:
Similarly I'd expect (from experience) that a programmer using  
Python to want to take the same approach, sticking with unencoded  
data in nearly all situations.


Yeah. This is a real issue I have with the direction Python3 went: it  
pushes you into decoding everything to unicode early, even when you  
don't care -- all you really wanted to do is pass it from one API to  
another, with some well-defined transformations, which don't actually  
depend on it having being decoded properly. (For example, extracting  
the path from the URL and attempting to open it as a file on the  
filesystem.)


This means that Python3 programs can become *more* fragile in the face  
of random data you encounter out in the real world, rather than less  
fragile, which was the goal of the whole exercise.


The surrogateescape method is a nice workaround for this, but I can't  
help thinking that it might've been better to just treat stuff as  
possibly-invalid-but-probably-utf8 byte-strings from input, through  
processing, to output. It seems kinda too late for that, though: next  
time someone designs a language, they can try that. :)


James___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread M.-A. Lemburg

Guido van Rossum wrote:
> [Just addressing one little issue here; generally I'm just happy that
> we're discussing this issue in such detail from so many points of
> view.]
> 
> On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi  wrote:
>> [...] Would urljoin(b_base, b_subdir) => bytes and
>> urljoin(u_base, u_subdir) => unicode be acceptable though?  (I think, given
>> other options, I'd rather see two separate functions, though.  It seems more
>> discoverable and less prone to taking bad input some of the time to have two
>> functions that clearly only take one type of data apiece.)
> 
> Hm. I'd rather see a single function (it would be "polymorphic" in my
> earlier terminology). After all a large number of string method calls
> (and some other utility function calls) already look the same
> regardless of whether they are handling bytes or text (as long as it's
> uniform). If the building blocks are all polymorphic it's easier to
> create additional polymorphic functions.
> 
> FWIW, there are two problems with polymorphic functions, though they
> can be overcome:
> 
> (1) Literals.
> 
> If you write something like x.split('&') you are implicitly assuming x
> is text. I don't see a very clean way to overcome this; you'll have to
> implement some kind of type check e.g.
> 
> x.split('&') if isinstance(x, str) else x.split(b'&')
> 
> A handy helper function can be written:
> 
>   def literal_as(constant, variable):
>   if isinstance(variable, str):
>   return constant
>   else:
>   return constant.encode('utf-8')
> 
> So now you can write x.split(literal_as('&', x)).

This polymorphism is what we used in Python2 a lot to write
code that works for both Unicode and 8-bit strings.

Unfortunately, this no longer works as easily in Python3 due
to the literals sometimes having the wrong type and using
such a helper function slows things down a lot.

It would be great if we could have something like the above as
builtin method:

x.split('&'.as(x))

Perhaps something to discuss on the language summit at EuroPython.

Too bad we can't add such porting enhancements to Python2 anymore.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 22 2010)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK26 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Toshio Kuratomi

On Tue, Jun 22, 2010 at 08:24:28AM -0500, Michael Urman wrote:
> On Tue, Jun 22, 2010 at 00:28, Stephen J. Turnbull  wrote:
> > Michael Urman writes:
> >
> >  > It is somewhat troublesome that there doesn't appear to be an obvious
> >  > built-in idempotent-when-possible function that gives back the
> >  > provided bytes/str,
> >
> > If you want something idempotent, it's already the case that
> > bytes(b'abc') => b'abc'.  What might be desirable is to make
> > bytes('abc') work and return b'abc', but only if 'abc' is pure ASCII
> > (or maybe ISO 8859/1).
> 
> By idempotent-when-possible, I mean to_bytes(str_or_bytes, encoding,
> errors) that would pass an instance of bytes through, or encode an
> instance of str. And of course a to_str that performs similarly,
> passing str through and decoding bytes. While bytes(b'abc') will give
> me b'abc', neither bytes('abc') nor bytes(b'abc', 'latin-1') get me
> the b'abc' I want to see.
> 
A month or so ago, I finally broke down and wrote a python2 library that had
these functions in it (along with a bunch of other trivial boilerplate
functions that I found myself writing over and over in different projects)

  
https://fedorahosted.org/releases/k/i/kitchen/docs/api-text-converters.html#unicode-and-byte-str-conversion

I suppose I could port this to python3 and we could see if it gains adoption
as a thirdparty addon.  I have been hesitating over doing that since I don't
use python3 for everyday work and I have a vague feeling that 2to3 won't
understand what that code needs to do.

-Toshio


pgpi8QfNv3gC0.pgp
Description: PGP signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] State of json in 2.7

2010-06-22 Thread Brett Cannon

[cc'ing Bob on his gmail address; didn't have any other address handy
so I don't know if this will actually get to him]

On Tue, Jun 22, 2010 at 09:54, Dirkjan Ochtman  wrote:
> It looks like simplejson 2.1.0 and 2.1.1 have been released:
>
> http://bob.pythonmac.org/archives/2010/03/10/simplejson-210/
> http://bob.pythonmac.org/archives/2010/03/31/simplejson-211/
>
> It looks like any changes that didn't come from the Python tree didn't
> go into the Python tree, either.

Has anyone asked Bob why he did this? There might be a logical reason.

-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] State of json in 2.7

2010-06-22 Thread Bob Ippolito

On Tuesday, June 22, 2010, Brett Cannon  wrote:
> [cc'ing Bob on his gmail address; didn't have any other address handy
> so I don't know if this will actually get to him]
>
> On Tue, Jun 22, 2010 at 09:54, Dirkjan Ochtman  wrote:
>> It looks like simplejson 2.1.0 and 2.1.1 have been released:
>>
>> http://bob.pythonmac.org/archives/2010/03/10/simplejson-210/
>> http://bob.pythonmac.org/archives/2010/03/31/simplejson-211/
>>
>> It looks like any changes that didn't come from the Python tree didn't
>> go into the Python tree, either.
>
> Has anyone asked Bob why he did this? There might be a logical reason.

I've just been busy. It's not trivial to move patches from one to the
other, so it's not something that has been easy for me to get around
to actually doing. It seems that more often than not when I have had
time to look at something, it didn't line up well with python's
release schedule.

(and speaking of busy I'm en route for a week long honeymoon so don't
expect much else from me on this thread)

-bob
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy


On 6/22/2010 1:22 AM, Glyph Lefkowitz wrote:


The thing that I have heard in passing from a couple of folks with
experience in this area is that some older software in asia would
present characters differently if they were originally encoded in a
"japanese" encoding versus a "chinese" encoding, even though they were
really "the same" characters.


As I tried to say in another post, that to me is similar to wanting to 
present English text is different fonts depending on whether spoken by 
an American or Brit, or a modern person versus a Renaissance person.



I do know that Han Unification is a giant political mess
( makes for some


Thanks, I will take a look.


interesting reading), but my understanding is that it has handled enough
of the cases by now that one can write software to display asian
languages and it will basically work with a modern version of unicode.
(And of course, there's always the private use area, as Stephen Turnbull
pointed out.)

Regardless, this is another example where keeping around a string isn't
really enough. If you need to display a japanese character in a distinct
way because you are operating in the japanese *script*, you need a tag
surrounding your data that is a hint to its presentation. The fact that
these presentation hints were sometimes determined by their encoding is
an unfortunate historical accident.


Yes. The asian languages I know anything about seems to natively have 
almost none of the symbols English has, many borrowed from math, that 
have been pressed into service for text markup.



--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Terry Reedy


On 6/22/2010 9:24 AM, Michael Urman wrote:


By idempotent-when-possible, I mean to_bytes(str_or_bytes, encoding,
errors) that would pass an instance of bytes through, or encode an
instance of str. And of course a to_str that performs similarly,
passing str through and decoding bytes. While bytes(b'abc') will give
me b'abc', neither bytes('abc') nor bytes(b'abc', 'latin-1') get me
the b'abc' I want to see.

These are trivial functions;
I just don't fully understand why the capability isn't baked in.


Possible reasons: They are special purpose functions easily built on the 
basic functions provided. Fine for a 3rd party library. Most people do 
not need them. Some might be mislead by them. As other have said, "Not 
every one-liner should be builtin".


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Terry Reedy


On 6/22/2010 12:53 PM, Guido van Rossum wrote:

On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger
  wrote:


On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:

   This is a common pain-point for porting software to 3.x - you had a
string, it kinda worked most of the time before, but now you need to keep
track of text too and the functions which seemed to work on bytes no longer
do.

Thanks Glyph.  That is a nice summary of one kind of challenge facing
programmers.


Ironically, Glyph also described the pain in 2.x: it only "kinda" worked.


The people with problematic code to convert must imclude some who 
managed to tolerate and perhaps suppress the pain. I suspect that 
conversion attempts brings it back to the surface. It is natural to 
blame the re-surfacer rather than the original source. (As in 'blame the 
messenger').



--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [OT] glyphs [was Re: email package status in 3.X]

2010-06-22 Thread Terry Reedy


On 6/22/2010 6:52 AM, Steven D'Aprano wrote:

On Tue, 22 Jun 2010 11:46:27 am Terry Reedy wrote:

3. Unicode disclaims direct representation of glyphic variants
(though again, exceptions were made for asian acceptance). For
example, in English, mechanically printed 'a' and 'g' are different
from manually printed 'a' and 'g'. Representing both by the same
codepoint, in itself, loses information. One who wishes to preserve
the distinction must instead use a font tag or perhaps a
  tag. Similarly, older English had a significantly
different glyph for 's', which looks more like a modern 'f'.


An unfortunate example, as the old English long-s gets its own Unicode
codepoint.


Whoops. I suppose I should thank you for the correction so I never make 
the same error again. Thank you.



http://en.wikipedia.org/wiki/Long_s


Very interesting to find out the source of both the integral sign and 
shilling symbols.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Use of cgi.escape can lead to XSS vulnerabilities

2010-06-22 Thread Craig Younkins

Hello,

The method in question: http://docs.python.org/library/cgi.html#cgi.escape
http://svn.python.org/view/python/tags/r265/Lib/cgi.py?view=markup   # at
the bottom

"Convert the characters '&', '<' and '>' in string s to HTML-safe sequences.
Use this if you need to display text that might contain such characters in
HTML. If the optional flag quote is true, the quotation mark character ('"')
is also translated; this helps for inclusion in an HTML attribute value, as
in . If the value to be quoted might include single- or
double-quote characters, or both, consider using the quoteattr() function in
the xml.sax.saxutils module instead."

cgi.escape never escapes single quote characters, which can easily lead to a
Cross-Site Scripting (XSS) vulnerability. This seems to be known by many,
but a quick search reveals many are using cgi.escape for HTML attribute
escaping.

The intended use of this method is unclear to me. Up to and including the
latest published version of Mako (0.3.3), this method was the HTML escaping
method. Used in this manner, single-quoted attributes with user-supplied
data are easily susceptible to cross-site scripting vulnerabilities.

Proof of concept in Mako:
>>> from mako.template import Template
>>> print Template("",
default_filters=['h']).render(data="' onload='alert(1);' id='")


I've emailed Michael Bayer, the creator of Mako, and this will be fixed in
version 0.3.4.

While the documentation says "if the value to be quoted might include
single- or double-quote characters... [use the] xml.sax.saxutils module
instead," it also implies that this method will make input safe for HTML.
Because this method escapes 4 of the 5 key XML characters, it is reasonable
to expect some will use it in the manner Mako did.

I suggest rewording the documentation for the method making it more clear
what it should and should not be used for. I would like to see the method
changed to properly escape single-quotes, but if it is not changed, the
documentation should explicitly say this method does not make input safe for
inclusion in HTML.

Shameless plug: http://www.PythonSecurity.org/

Craig Younkins
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight  wrote:

> The surrogateescape method is a nice workaround for this, but I can't help
> thinking that it might've been better to just treat stuff as
> possibly-invalid-but-probably-utf8 byte-strings from input, through
> processing, to output. It seems kinda too late for that, though: next time
> someone designs a language, they can try that. :)
>

surrogateescape does help a lot, my only problem with it is that it's
out-of-band information.  That is, if you have data that went through
data.decode('utf8', 'surrogateescape') you can restore it to bytes or
transcode it to another encoding, but you have to know that it was decoded
specifically that way.  And of course if you did have to transcode it (e.g.,
text.encode('utf8', 'surrogateescape').decode('latin1')) then if you had
actually handled the text in any way you may have broken it; you don't
*really* have valid text.  A lazier solution feels like it would be easier
and more transparent to work with.

But... I also don't see any major language constraint to having another kind
of string that is bytes+encoding.  I think PJE brought up a problem with a
couple coercion aspects.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Terry Reedy

Tres, I am a Python3 enthusiast and realist. I did not expect major 
adoption for about 3 years (more optimistic than the 5 years of some).


If you are feeling pressured to 'move' to Python3, it is not from me. I 
am sure you will do so on your own, perhaps even with enthusiasm, when 
it will be good for *you* to do so.


If someone wants to contribute while sticking to Python2, its easy. The 
tracker has perhaps 2000 open 2.x issues, hundreds with no responses. If 
more Python2 people worked on making 2.7 as bug-free as possible, the 
developers would be freer to make 3.2 as good as possible (which is what 
*I* want).


The porting of numpy (which I suspect has gotten some urging) will not 
just benefit 'nemerical' computing. For instance, there cannot be a 3.x 
version of pygame until there is a 3.x version of numpy, its main Python 
dependency. (The C Simple Directmedia Llibrary it also wraps and builds 
upon does not care.)


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Guido van Rossum

On Tue, Jun 22, 2010 at 9:37 AM, Tres Seaver  wrote:
> Any "turdiness" (which I am *not* arguing for) is a natural consequence
> of the kinds of backward incompatibilities which were *not* ruled out
> for Python 3, along with the (early, now waning) "build it and they will
>  come" optimism about adoption rates.

FWIW, my optimisim is *not* waning. I think it's good that we're
having this discussion and I expect something useful will come out of
it; I also expect in general that the (admittedly serious) problem of
having to port all dependencies will be solved in the next few years.
Not by magic, but because many people are taking small steps in the
right direction, and there will be light eventually. In the mean time
I don't blame anyone for sticking with 2.x or being too busy to help
port stuff to 3.x. Python 3 has been a long time in the making -- it
will be a bit longer still, which was expected.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Use of cgi.escape can lead to XSS vulnerabilities

2010-06-22 Thread Bill Janssen

Craig Younkins  wrote:

> cgi.escape never escapes single quote characters, which can easily lead to a
> Cross-Site Scripting (XSS) vulnerability. This seems to be known by many,
> but a quick search reveals many are using cgi.escape for HTML attribute
> escaping.

Did you file a bug report?

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins

On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg  wrote:

>>           return constant.encode('utf-8')
>>
>> So now you can write x.split(literal_as('&', x)).
>
> This polymorphism is what we used in Python2 a lot to write
> code that works for both Unicode and 8-bit strings.
>
> Unfortunately, this no longer works as easily in Python3 due
> to the literals sometimes having the wrong type and using
> such a helper function slows things down a lot.

I didn't work in 2 either - see for instance the traceback module with
an Exception with unicode args and a non-ascii file path - the file
path is in its bytes form, the string joining logic triggers an
implicit upcast and *boom*.

> Too bad we can't add such porting enhancements to Python2 anymore

Perhaps a 'py3compat' module on pypi, with things like the py._builtin
reraise helper and so forth ?

-Rob
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Martin v. Löwis


This effectively substitutes getgrouplist called on the current user
for getgroups.  In 3.x, I believe the correct action will be to
provide direct access to getgrouplist which is while not POSIX (yet?),
is widely available.


As a policy, adding non-POSIX functions to the posix module is perfectly 
fine, as long as there is an autoconf test for it

(plain ifdefs are gruntingly accepted also).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] State of json in 2.7

2010-06-22 Thread Fred Drake

On Tue, Jun 22, 2010 at 12:56 PM, Benjamin Peterson  wrote:
> Never have externally maintained packages.

Seriously!  I concur with this.

Fortunately, it's not a real problem in this case.

There's the (maintained) simplejson package, and the unmaintained json
package.  And simplejson works with older versions of Python, too,
:-)


  -Fred

-- 
Fred L. Drake, Jr.
"A storm broke loose in my mind."  --Albert Einstein
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan

On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum  wrote:
> (1) Literals.
>
> If you write something like x.split('&') you are implicitly assuming x
> is text. I don't see a very clean way to overcome this; you'll have to
> implement some kind of type check e.g.
>
>    x.split('&') if isinstance(x, str) else x.split(b'&')
>
> A handy helper function can be written:
>
>  def literal_as(constant, variable):
>      if isinstance(variable, str):
>          return constant
>      else:
>          return constant.encode('utf-8')
>
> So now you can write x.split(literal_as('&', x)).

I think this is a key point. In checking the behaviour of the os
module bytes APIs (see below), I used a simple filter along the lines
of:

  [x for x in seq if x.endswith("b")]

It would be nice if code along those lines could easily be made polymorphic.

Maybe what we want is a new class method on bytes and str (this idea
is similar to what MAL suggests later in the thread):

  def coerce(cls, obj, encoding=None, errors='surrogateescape'):
if isinstance(obj, cls):
return existing
if encoding is None:
encoding = sys.getdefaultencoding()
# This is the str version, bytes,coerce would use obj.encode() instead
return obj.decode(encoding, errors)

Then my example above could be made polymorphic (for ASCII compatible
encodings) by writing:

  [x for x in seq if x.endswith(x.coerce("b"))]

I'm trying to see downsides to this idea, and I'm not really seeing
any (well, other than 2.7 being almost out the door and the fact we'd
have to grant ourselves an exception to the language moratorium)

> (2) Data sources.
>
> These can be functions that produce new data from non-string data,
> e.g. str(), read it from a named file, etc. An example is read()
> vs. write(): it's easy to create a (hypothetical) polymorphic stream
> object that accepts both f.write('booh') and f.write(b'booh'); but you
> need some other hack to make read() return something that matches a
> desired return type. I don't have a generic suggestion for a solution;
> for streams in particular, the existing distinction between binary and
> text streams works, of course, but there are other situations where
> this doesn't generalize (I think some XML interfaces have this
> awkwardness in their API for converting a tree to a string).

We may need to use the os and io modules as the precedents here:

os: normal API is text using the surrogateescape error handler,
parallel bytes API exposes raw bytes. Parallel API is polymorphic if
possible (e.g. os.listdir), but appends a 'b' to the name if the
polymorphic approach isn't practical (e.g. os.environb, os.getcwdb,
os.getenvb).
io. layered API, where both the raw bytes of the wire protocol and the
decoded bytes of the text layer are available

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Nick Coghlan

On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg  wrote:
> It would be great if we could have something like the above as
> builtin method:
>
> x.split('&'.as(x))

As per my other message, another possible (and reasonably intuitive)
spelling would be:

  x.split(x.coerce('&'))

Writing it as a helper function is also possible, although it be
trickier to remember the correct argument ordering:

  def coerce_to(target, obj, encoding=None, errors='surrogateescape'):
if isinstance(obj, type(target)):
return obj
if encoding is None:
encoding = sys.getdefaultencoding()
try::
convert = obj.decode
except AttributeError:
convert = obj.encode
return convert(encoding, errors)

  x.split(coerce_to(x, '&'))

> Perhaps something to discuss on the language summit at EuroPython.
>
> Too bad we can't add such porting enhancements to Python2 anymore.

Well, we can if we really want to, it just entails convincing Benjamin
to reschedule the 2.7 final release. Given the UserDict/ABC/old-style
classes issue, there's a fair chance there's going to be at least one
more 2.7 RC anyway.

That said, since this kind of coercion can be done in a helper
function, that should be adequate for the 2.x to 3.x conversion case
(for 2.x, the helper function can be defined to just return the second
argument since bytes and str are the same type, while the 3.x version
would look something like the code above)

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Greg Ewing


Benjamin Peterson wrote:


IIRC this was because UserDict tries to be a MutableMapping but abcs
require new style classes.


Are there any use cases for UserList and UserDict in new
code, now that list and dict can be subclassed?

If not, I don't think it would be a big problem if they
were left out of the ABC ecosystem. No worse than what
happens to any other existing user-defined class that
predates ABCs -- if people want them to inherit from
ABCs, they have to update their code. In this case, the
update would consist of changing subclasses to inherit
from list or dict instead.

--
Greg
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Michael Foord


On 23/06/2010 00:03, Greg Ewing wrote:

Benjamin Peterson wrote:


IIRC this was because UserDict tries to be a MutableMapping but abcs
require new style classes.


Are there any use cases for UserList and UserDict in new
code, now that list and dict can be subclassed?


Inheriting from list or dict isn't very useful as you to have to 
override *every* method to control behaviour.


(For example with the dict if you override __setitem__ then update and 
setdefault (etc) don't go through your new __setitem__ and if you 
override __getitem__ then pop and friends don't go through your new 
__getitem__.)


In 2.6+ you can of course use the collections.MutableMapping abc, but if 
you want to write cross-Python version code UserDict is still useful. If 
you want abc support then you are *already* on 2.6+ though I guess.


All the best,

Michael



If not, I don't think it would be a big problem if they
were left out of the ABC ecosystem. No worse than what
happens to any other existing user-defined class that
predates ABCs -- if people want them to inherit from
ABCs, they have to update their code. In this case, the
update would consist of changing subclasses to inherit
from list or dict instead.




--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord


On 22/06/2010 22:40, Robert Collins wrote:

On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg  wrote:

   

   return constant.encode('utf-8')

So now you can write x.split(literal_as('&', x)).
   

This polymorphism is what we used in Python2 a lot to write
code that works for both Unicode and 8-bit strings.

Unfortunately, this no longer works as easily in Python3 due
to the literals sometimes having the wrong type and using
such a helper function slows things down a lot.
 

I didn't work in 2 either - see for instance the traceback module with
an Exception with unicode args and a non-ascii file path - the file
path is in its bytes form, the string joining logic triggers an
implicit upcast and *boom*.

   
Yeah, there are still a few places in unittest where a unicode exception 
can cause the whole test run to bomb out. No-one has *yet* reported 
these as bugs and I try and ferret them out as I find them.


All the best,

Michael


Too bad we can't add such porting enhancements to Python2 anymore
 

Perhaps a 'py3compat' module on pypi, with things like the py._builtin
reraise helper and so forth ?

-Rob
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Raymond Hettinger


On Jun 22, 2010, at 3:59 PM, Michael Foord wrote:

> On 23/06/2010 00:03, Greg Ewing wrote:
>> Benjamin Peterson wrote:
>> 
>>> IIRC this was because UserDict tries to be a MutableMapping but abcs
>>> require new style classes.
>> 
>> Are there any use cases for UserList and UserDict in new
>> code, now that list and dict can be subclassed?
> 
> Inheriting from list or dict isn't very useful as you to have to override 
> *every* method to control behaviour.


Benjamin fixed the UserDict  and ABC problem earlier today in r82155.
It is now the same as it was in Py2.6.
Nothing to see here.
Move along.


Raymond
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Michael Foord


On 22/06/2010 19:07, James Y Knight wrote:


On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote:
Similarly I'd expect (from experience) that a programmer using Python 
to want to take the same approach, sticking with unencoded data in 
nearly all situations.


Yeah. This is a real issue I have with the direction Python3 went: it 
pushes you into decoding everything to unicode early,


Well, both .NET and Java take this approach as well. I wonder how they 
cope with the particular issues that have been mentioned for web 
applications - both platforms are used extensively for web apps.


Having used IronPython, which has .NET unicode strings (although it does 
a lot of magic to *allow* you to store binary data in strings for 
compatibility with CPython),  I have to say that this approach makes a 
lot of programming *so* much more pleasant.


We did a lot of I/O (can you do useful programming without I/O?) 
including working with databases, but I didn't work *much* with wire 
protocols (fetching a fair bit of data from the web though now I think 
about it). I think wire protocols can present particular problems; 
sometimes having mixed encodings in the same data it seems. Where you 
don't have these problems keeping bytes data and all Unicode text data 
separate and encoding / decoding at the boundaries is really much more 
sane and pleasant.


It would be a real shame if we decided that the way forward for Python 3 
was to try and move closer to how bytes/text was handled in Python 2.


All the best,

Michael

even when you don't care -- all you really wanted to do is pass it 
from one API to another, with some well-defined transformations, which 
don't actually depend on it having being decoded properly. (For 
example, extracting the path from the URL and attempting to open it as 
a file on the filesystem.)


This means that Python3 programs can become *more* fragile in the face 
of random data you encounter out in the real world, rather than less 
fragile, which was the goal of the whole exercise.


The surrogateescape method is a nice workaround for this, but I can't 
help thinking that it might've been better to just treat stuff as 
possibly-invalid-but-probably-utf8 byte-strings from input, through 
processing, to output. It seems kinda too late for that, though: next 
time someone designs a language, they can try that. :)


James


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of your 
employer, to release me from all obligations and waivers arising from any and all 
NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, 
confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS 
AGREEMENTS") that I have entered into with your employer, its partners, licensors, 
agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. 
You further represent that you have the authority to release me from any BOGUS AGREEMENTS 
on behalf of your employer.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Ian Bicking

On Tue, Jun 22, 2010 at 11:17 AM, Guido van Rossum  wrote:

> (2) Data sources.
>
> These can be functions that produce new data from non-string data,
> e.g. str(), read it from a named file, etc. An example is read()
> vs. write(): it's easy to create a (hypothetical) polymorphic stream
> object that accepts both f.write('booh') and f.write(b'booh'); but you
> need some other hack to make read() return something that matches a
> desired return type. I don't have a generic suggestion for a solution;
> for streams in particular, the existing distinction between binary and
> text streams works, of course, but there are other situations where
> this doesn't generalize (I think some XML interfaces have this
> awkwardness in their API for converting a tree to a string).
>

This reminds me of the optimization ElementTree and lxml made in Python 2
(not sure what they do in Python 3?) where they use str when a string is
ASCII to avoid the memory and performance overhead of unicode.  Also at
least lxml is also dealing with the divide between the internal libxml2
string representation and the Python representation.  This is a place where
bytes+encoding might also have some benefit.  XML is someplace where you
might load a bunch of data but only touch a little bit of it, and the amount
of data is frequently large enough that the efficiencies are important.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread P.J. Eby


At 07:41 AM 6/23/2010 +1000, Nick Coghlan wrote:

Then my example above could be made polymorphic (for ASCII compatible
encodings) by writing:

  [x for x in seq if x.endswith(x.coerce("b"))]

I'm trying to see downsides to this idea, and I'm not really seeing
any (well, other than 2.7 being almost out the door and the fact we'd
have to grant ourselves an exception to the language moratorium)


Notice, however, that if multi-string operations used a coercion 
protocol (they currently have to do type checks already for 
byte/unicode mixes), then you could make the entire stdlib 
polymorphic by default, even for other kinds of strings that don't exist yet.


If you invent a new numeric type, generally speaking you can pass it 
to existing stdlib functions taking numbers, as long as it implements 
the appropriate protocols.  Why not do the same for strings?


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz


On Jun 22, 2010, at 12:53 PM, Guido van Rossum wrote:

> On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger
>  wrote:
>> 
>> On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote:
>> 
>> This is a common pain-point for porting software to 3.x - you had a
>> string, it kinda worked most of the time before, but now you need to keep
>> track of text too and the functions which seemed to work on bytes no longer
>> do.
>> 
>> Thanks Glyph.  That is a nice summary of one kind of challenge facing
>> programmers.
> 
> Ironically, Glyph also described the pain in 2.x: it only "kinda" worked.

It was not my intention to be ironic about it - that was exactly what I meant 
:).  3.x is forcing you to confront an issue that you _should_ have confronted 
for 2.x anyway. 

(And, I hope, most libraries doing a 3.x migration will take the opportunity to 
make their 2.x APIs unicode-clean while still in 2to3 mode, and jump ship to 
3.x source only _after_ there's a nice transition path for their clients that 
can be taken in 2 steps.)

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz

On Jun 22, 2010, at 2:07 PM, James Y Knight wrote:

> Yeah. This is a real issue I have with the direction Python3 went: it pushes 
> you into decoding everything to unicode early, even when you don't care -- 
> all you really wanted to do is pass it from one API to another, with some 
> well-defined transformations, which don't actually depend on it having being 
> decoded properly. (For example, extracting the path from the URL and 
> attempting to open it as a file on the filesystem.)

But you _do_ need to decode it in this case.  If you got your URL from some 
funky UTF-32 datasource, b"\x00\x00\x00/" is not a path separator, "/" is.  
Plus, you should really be separating path segments and looking at them 
individually so that you don't fall victim to "%2F" bugs.  And if you want your 
code to be portable, you need a Unicode representation of your pathname anyway 
for Windows; plus, there, you need to care about "\" as well as "/".

The fact that your wire-bytes were probably ASCII(-ish) and your filesystem 
probably encodes pathnames as UTF-8 and so everything looks like it lines up is 
no excuse not to be explicit about your expectations there.

You may want to transcode your characters into some other characters later, but 
that shouldn't stop you from treating them as characters of some variety in the 
meanwhile.

> The surrogateescape method is a nice workaround for this, but I can't help 
> thinking that it might've been better to just treat stuff as 
> possibly-invalid-but-probably-utf8 byte-strings from input, through 
> processing, to output. It seems kinda too late for that, though: next time 
> someone designs a language, they can try that. :)

I can think of lots of optimizations that might be interesting for Python (or 
perhaps some other runtime less concerned with cleverness overload, like PyPy) 
to implement, like a UTF-8 combining-characters overlay that would allow for 
fast indexing, lazily populated as random access dictates.  But this could all 
be implemented as smartness inside .encode() and .decode() and the str and 
bytes types without changing the way the API works.

I realize that there are implications at the C level, but as long as you can 
squeeze a function call in to certain places, it could still work.

I can also appreciate what's been said in this thread a bunch of times: to my 
knowledge, nobody has actually shown a profile of an application where encoding 
is significant overhead.  I believe that encoding _will_ be a significant 
overhead for some applications (and actually I think it will be very 
significant for some applications that I work on), but optimizations should 
really be implemented once that's been demonstrated, so that there's a better 
understanding of what the overhead is, exactly.  Is memory a big deal?  Is CPU? 
 Is it both?  Do you want to tune for the tradeoff?  etc, etc.  Clever 
data-structures seem premature until someone has a good idea of all those 
things.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Glyph Lefkowitz

On Jun 22, 2010, at 7:23 PM, Ian Bicking wrote:

> This is a place where bytes+encoding might also have some benefit.  XML is 
> someplace where you might load a bunch of data but only touch a little bit of 
> it, and the amount of data is frequently large enough that the efficiencies 
> are important.

Different encodings have different characteristics, though, which makes them 
amenable to different types of optimizations.  If you've got an ASCII string or 
a latin1 string, the optimizations of unicode are pretty obvious; if you've got 
one in UTF-16 with no multi-code-unit sequences, you could also hypothetically 
cheat for a while if you're on a UCS4 build of Python.

I suspect the practical problem here is that there's no CharacterString ABC in 
the collections module for third-party libraries to provide their own 
peculiarly-optimized implementations that could lazily turn into real 'str's as 
needed.  I'd volunteer to write a PEP if I thought I could actually get it done 
:-\.  If someone else wants to be the primary author though, I'll try to help 
out.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] email package status in 3.X

2010-06-22 Thread Michael Urman

On Tue, Jun 22, 2010 at 15:32, Terry Reedy  wrote:
> On 6/22/2010 9:24 AM, Michael Urman wrote:
>> These are trivial functions;
>> I just don't fully understand why the capability isn't baked in.
>
> Possible reasons: They are special purpose functions easily built on the
> basic functions provided. Fine for a 3rd party library. Most people do not
> need them. Some might be mislead by them. As other have said, "Not every
> one-liner should be builtin".

Perhaps the two-argument constructions on bytes and str should have
been removed in favor of the .decode and .encode methods on their
respective classes. Or vice versa; I don't have the history to know in
which order they originated, and which is theoretically preferred
these days.

-- 
Michael Urman
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Mike Klaas

On Tue, Jun 22, 2010 at 4:23 PM, Ian Bicking  wrote:

> This reminds me of the optimization ElementTree and lxml made in Python 2
> (not sure what they do in Python 3?) where they use str when a string is
> ASCII to avoid the memory and performance overhead of unicode.

An optimization that forces me to typecheck the return value of the
function and that I only discovered after code started breaking.  I
can't say was enthused about that decision when I discovered it.

-Mike
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Robert Collins

On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz
 wrote:
> I can also appreciate what's been said in this thread a bunch of times: to my 
> knowledge, nobody has actually shown a profile of an application where 
> encoding is significant overhead.  I believe that encoding _will_ be a 
> significant overhead for some applications (and actually I think it will be 
> very significant for some applications that I work on), but optimizations 
> should really be implemented once that's been demonstrated, so that there's a 
> better understanding of what the overhead is, exactly.  Is memory a big deal? 
>  Is CPU?  Is it both?  Do you want to tune for the tradeoff?  etc, etc.  
> Clever data-structures seem premature until someone has a good idea of all 
> those things.

bzr has a cache of decoded strings in it precisely because decode is
slow. We accept slowness encoding to the users locale because thats
typically much less data to examine than we've examined while
generating the commit/diff/whatever. We also face memory pressure on a
regular basis, and that has been, at least partly, due to UCS4 - our
translation cache helps there because we have less duplicate UCS4
strings.

You're welcome to dig deeper into this, but I don't have more detail
paged into my head at the moment.

-Rob
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] red buildbots on 2.7

2010-06-22 Thread Bill Janssen

Bill Janssen  wrote:

> Considering that we've just released 2.7rc2, there are an awful lot of
> red buildbots for 2.7.  In fact, I don't remember having seen a green
> buildbot for OS X and 2.7.  Shouldn't these be fixed?

Thanks to some action by Ronald, my two PPC OS X buildbots are now
showing green for the trunk.

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] UserDict in 2.7

2010-06-22 Thread Fred Drake

On Tue, Jun 22, 2010 at 7:17 PM, Raymond Hettinger
 wrote:
> Benjamin fixed the UserDict  and ABC problem earlier today in r82155.
> It is now the same as it was in Py2.6.

Thanks, Benjamin!


  -Fred

-- 
Fred L. Drake, Jr.
"A storm broke loose in my mind."  --Albert Einstein
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

2010-06-22 Thread Stephen J. Turnbull

Ian Bicking writes:

 > Just for perspective, I don't know if I've ever wanted to deal with a URL
 > like that.

Ditto, I do many times a day for Japanese media sites and Wikipedia.

 > I know how it is supposed to work, and I know what a browser does
 > with that, but so many tools will clean that URL up *or* won't be
 > able to deal with it at all that it's not something I'll be passing
 > around.

I'm not suggesting that is something you want to be "passing around";
it's a presentation form, and I prefer that the internal form use
Unicode.

 > While it's nice to be correct about encodings, sometimes it is
 > impractical.  And it is far nicer to avoid the situation entirely.

But you cannot avoid it entirely.  Processing bytes mean you are
assuming ASCII compatibility.  Granted, this is a pretty good
assumption, especially if you got the bytes off the wire, but it's not
universally so.

Maybe it's a YAGNI, but one reason I prefer the decode-process-encode
paradigm is that choice of codec is a specification of the assumptions
you're making about encoding.  So the Know-Nothing codec described
above assumes just enough ASCII compatibility to parse the scheme.
You could also have codecs which assume just enough ASCII
compatibility to parse a hierarchical scheme, etc.

 > That is, decoding content you don't care about isn't just
 > inefficient, it's complicated and can introduce errors.

That depends on the codec(s) used.

 > Similarly I'd expect (from experience) that a programmer using
 > Python to want to take the same approach, sticking with unencoded
 > data in nearly all situations.

Indeed, a programmer using Python 2 would want to do so, because all
her literal strings are bytes by default (ie, if she doesn't mark them
with `u'), and interactive input is, too.  This is no longer so
obvious in Python 3 which takes the attitude that things that are
expected to be human-readable should be processed as str.  The obvious
example in URI space is the file:/// URL, which you'll typically build
up from a user string or a file browser, which will call the os.path
stuff which returns str.

Text editors and viewers will also use str for their buffers, and if
they provide a way to fish out URIs for their users, they'll probably
return str.

I won't pretend to judge the relative importance of such use cases.
But use cases for urllib which naturally favor str until you put the
URI on the wire do exist, as does the debugging presentation aspect.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

67 matches

Mail list logo