Antoine Pitrou [EMAIL PROTECTED] added the comment:
There's an unquote()-related failure in #3613.
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Matt Giuca [EMAIL PROTECTED] added the comment:
Thanks for pointing that out, Antoine. I just commented on that bug.
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
Matt Giuca [EMAIL PROTECTED] added the comment:
Hi,
Sorry to bump this, but you (Guido) said you wanted this closed by
Wednesday. Is this patch committable yet? (There are no more unresolved
issues that I am aware of).
___
Python tracker [EMAIL PROTECTED]
Guido van Rossum [EMAIL PROTECTED] added the comment:
Looking into this now. Will make sure it's included in beta3.
--
assignee: - gvanrossum
priority: - release blocker
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
Guido van Rossum [EMAIL PROTECTED] added the comment:
Checked in patch 10 with minor style changes as r65838.
Thanks Matt for persevering! Thanks everyone else for contributing;
this has been quite educational.
--
resolution: - accepted
status: open - closed
Matt Giuca [EMAIL PROTECTED] added the comment:
Ah cheers Antoine, for the tip on using defaultdict (I was confused as
to how I could access the key just by passing defaultfactory, as the
manual suggests).
___
Python tracker [EMAIL PROTECTED]
Matt Giuca [EMAIL PROTECTED] added the comment:
OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.
import random, urllib.parse
for i in range(0, 10):
str = ''.join(chr(random.randint(0, 0x10)) for _ in range(50))
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Hello Matt,
OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.
import random, urllib.parse
for i in range(0, 10):
str = ''.join(chr(random.randint(0, 0x10)) for _
Matt Giuca [EMAIL PROTECTED] added the comment:
New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).
Another update on the remaining outstanding issues:
Resolved issues since last time:
Should unquote accept a bytes/bytearray as well as a str?
No. But
Matt Giuca [EMAIL PROTECTED] added the comment:
Antoine:
I think if you move the line defining str out of the loop, relative
timings should change quite a bit. Chances are that the random
functions are not very fast, since they are written in pure Python.
Well I wanted to test throwing lots
Matt Giuca [EMAIL PROTECTED] added the comment:
I have no strong opinion on the very remaining points you listed,
except that IMHO encode_rfc2231 with charset=None should not try to
use UTF8 by default. But someone with more mail protocol skills
should comment :)
OK I've come to the
Changes by Marc-Andre Lemburg [EMAIL PROTECTED]:
--
nosy: -lemburg
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing
Guido van Rossum [EMAIL PROTECTED] added the comment:
On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca [EMAIL PROTECTED] wrote:
I have no strong opinion on the very remaining points you listed,
except that IMHO encode_rfc2231 with charset=None should not try to
use UTF8 by default. But someone
Matt Giuca [EMAIL PROTECTED] added the comment:
I'm OK with replace for unquote() ...
For quote() I think strict is better
There's just an odd inconsistency there, but it's only a tiny gotcha;
and I agree with all your other arguments. I'll change unquote back to
errors='replace'.
This
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Selon Matt Giuca [EMAIL PROTECTED]:
Now that you've spent so much time with this patch, can't you think
of a faster way of doing this?
Well firstly, you could replace Quoter (the class) with a quoter
function, which is nested inside
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Selon Antoine Pitrou [EMAIL PROTECTED]:
As for the defaultdict, here is how it can look like (this is on 2.5):
(there should be a line here saying class D(defaultdict) :-))
... def __missing__(self, key):
... print __missing__, key
Guido van Rossum [EMAIL PROTECTED] added the comment:
Now that you've spent so much time with this patch, can't you think
of a faster way of doing this?
Well firstly, you could replace Quoter (the class) with a quoter
function, which is nested inside quote. Would calling a nested function
Bill Janssen [EMAIL PROTECTED] added the comment:
Feel free to take the function implementation from my patch, if it speeds
things up (and it should).
Bill
On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum [EMAIL PROTECTED]wrote:
Guido van Rossum [EMAIL PROTECTED] added the comment:
Now
Bill Janssen [EMAIL PROTECTED] added the comment:
Erik van der Poel at Google has now chimed in with stats on current URL
usage:
``...the bottom line is that escaped non-utf-8 is still quite prevalent,
enough (in my opinion) to require an implementation in Python, possibly
even allowing for
Guido van Rossum [EMAIL PROTECTED] added the comment:
Bill Janssen [EMAIL PROTECTED] added the comment:
Erik van der Poel at Google has now chimed in with stats on current URL
usage:
``...the bottom line is that escaped non-utf-8 is still quite prevalent,
enough (in my opinion) to require
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Le mercredi 13 août 2008 à 17:05 +, Bill Janssen a écrit :
I think it's worth remembering that a very large proportion of the use
of Python's urllib.unquote() is in implementations of Web server
frameworks of one sort or another. We
Bill Janssen [EMAIL PROTECTED] added the comment:
On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou [EMAIL PROTECTED]wrote:
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Le mercredi 13 août 2008 à 17:05 +, Bill Janssen a écrit :
I think it's worth remembering that a very large
Changes by Guido van Rossum [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11107/unnamed
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
Changes by Guido van Rossum [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11106/unnamed
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
Matt Giuca [EMAIL PROTECTED] added the comment:
Bill, this debate is getting snipy, and going nowhere. We could argue
about what is the pure and correct thing to do, but we have a
limited time frame here, so I suggest we just look at the important facts.
1. There is an overwhelming consensus
Matt Giuca [EMAIL PROTECTED] added the comment:
By the way, what is the current status of this bug? Is anybody waiting
on me to do anything? (Re: Patch 9)
To recap my previous list of outstanding issues raised by the review:
Should unquote accept a bytes/bytearray as well as a str?
Currently,
Antoine Pitrou [EMAIL PROTECTED] added the comment:
I agree that given two similar patches, the one with more tests earns
some bonus points. Also, it seems to me that round-trippability of
quote()/unquote() is a logical and semantic requirement: in particular,
if there is a default encoding, it
Bill Janssen [EMAIL PROTECTED] added the comment:
Larry Masinter is off on vacation, but I did get a brief message saying
that he will dig up similar discussions that he was involved in when he
gets back.
Out of curiosity, I sent a note off to the www-international mailing
list, and received
Bill Janssen [EMAIL PROTECTED] added the comment:
For Antoine:
I think the problem that Barry is facing with the email package is that
Unicode strings are an ambiguous representation of a sequence of bytes;
that is, there are a number of different byte sequences a Unicode string
may have come
Bill Janssen [EMAIL PROTECTED] added the comment:
Here's another thought:
Let's put string_to_bytes and string_from_bytes into the binascii
module, as a2b_percent and b2a_percent, respectively.
Then parse.py would import them as
from binascii import a2b_percent as percent_decode_as_bytes
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Le mardi 12 août 2008 à 19:37 +, Bill Janssen a écrit :
Let's put string_to_bytes and string_from_bytes into the binascii
module, as a2b_percent and b2a_percent, respectively.
Well, it's my personal opinion, but I think we should focus
Guido van Rossum [EMAIL PROTECTED] added the comment:
Matt Giuca [EMAIL PROTECTED] added the comment:
By the way, what is the current status of this bug? Is anybody waiting
on me to do anything? (Re: Patch 9)
I'll be reviewing it today or tomorrow. From looking at it briefly I
worry that the
Bill Janssen [EMAIL PROTECTED] added the comment:
On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca [EMAIL PROTECTED] wrote:
Matt Giuca [EMAIL PROTECTED] added the comment:
Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why
Changes by Bill Janssen [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11101/unnamed
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Bill Janssen [EMAIL PROTECTED] added the comment:
Some interesting notes here (from Erik van der Poel at Google; Guido,
you might want to stroll over to his location and talk with him):
http://lists.w3.org/Archives/Public/www-international/2007JanMar/0004.html
and more particularly
Matt Giuca [EMAIL PROTECTED] added the comment:
Guido suggested that quote's safe parameter should allow any
character, not just ASCII range. I've implemented this now. It was a lot
messier than I imagined.
The problem is that in my older patches, both 's' and 'safe' are encoded
to bytes right
Matt Giuca [EMAIL PROTECTED] added the comment:
Made a bunch of requested changes (I've reverted the all safe patch
for now since it caused so much grief; see above).
* quote: Fixed encoding illegal % sequences (and lots of new test cases
to prove it).
* quote now throws a type error if s is
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Le dimanche 10 août 2008 à 07:05 +, Matt Giuca a écrit :
I don't think it's worth the extra code bloat and performance hit just
to implement a feature whose only use is producing invalid URIs (since
URIs are supposed to only have ASCII
Matt Giuca [EMAIL PROTECTED] added the comment:
Invalid user input? What if the query string comes from filling
a form?
For example if I search the word numéro in a latin1 Web site,
I get the following URL:
http://www.le-tigre.net/spip.php?page=rechercherecherche=num%E9ro
Yes, that is a
Matt Giuca [EMAIL PROTECTED] added the comment:
Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why you had to
rewrite everything to implement the same basic behaviour as my patch.
(My latest few patches support bytes both
Jim Jewett [EMAIL PROTECTED] added the comment:
Matt,
Bill's main concern is with a policy decision; I doubt he would object to
using your code once that is resolved.
The purpose of the quoting functions is to turn a string (representing the
human-readable version) into bytes (that go over
Matt Giuca [EMAIL PROTECTED] added the comment:
Bill's main concern is with a policy decision; I doubt he would
object to using your code once that is resolved.
But his patch does the same basic operations as mine, just implemented
differently and with the heap of issues I outlined above. So
Matt Giuca [EMAIL PROTECTED] added the comment:
I've been thinking more about the errors=strict default. I think this
was Guido's suggestion. I've decided I'd rather stick with errors=replace.
I changed errors=replace to errors=strict in patch 8, but now I'm
worried that will cause problems,
Bill Janssen [EMAIL PROTECTED] added the comment:
Here's the updated version of my patch. It returns a string, but
doesn't clobber bytes that are contained in the string. Naive code
(basically code that expects ASCII strings from unquote) should continue
to work as well as it ever did. I
Changes by Bill Janssen [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11064/patch
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Matt Giuca [EMAIL PROTECTED] added the comment:
Dear GvR,
New code review comments by mgiuca have been published.
Please go to http://codereview.appspot.com/2827 to read them.
Message:
Hi Guido,
Thanks very much for this very detailed review. I've replied to the
comments. I will make the
Matt Giuca [EMAIL PROTECTED] added the comment:
A reply to a point on GvR's review, I'd like to open for discussion.
This relates to whether or not quote's safe argument should allow
non-ASCII characters.
Using errors='ignore' seems like a mistake -- it will hide errors. I
also wonder why
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Le jeudi 07 août 2008 à 13:42 +, Matt Giuca a écrit :
The reasoning is this: if we allow non-ASCII characters to be escaped,
then we allow quote to generate invalid URIs (URIs are only allowed to
have ASCII characters). It's one thing
Matt Giuca [EMAIL PROTECTED] added the comment:
The important is that the defaults are safe. If users want to override
the defaults and produce potentially invalid URIs, there is no reason to
discourage them.
OK I think that's a fairly valid argument. I'm about to head off so I'll
post the
Matt Giuca [EMAIL PROTECTED] added the comment:
Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all
Matt Giuca [EMAIL PROTECTED] added the comment:
I'm also attaching a metapatch - diff from patch 7 to patch 8. This is
to give a rough idea of what I changed since the review.
(Sorry - This is actually a diff between the two patches, so it's pretty
hard to read. It would have been nicer to diff
Guido van Rossum [EMAIL PROTECTED] added the comment:
On Thu, Aug 7, 2008 at 8:03 AM, Matt Giuca [EMAIL PROTECTED] wrote:
Matt Giuca [EMAIL PROTECTED] added the comment:
I'm also attaching a metapatch - diff from patch 7 to patch 8. This is
to give a rough idea of what I changed since the
Bill Janssen [EMAIL PROTECTED] added the comment:
Just to reply to Antoine's comments on my patch:
- it would be nice to have more unit tests, especially for the various
bytes/unicode possibilities, and perhaps also roundtripping (Matt's
patch has a lot of tests)
Yes, I completely agree.
-
Bill Janssen [EMAIL PROTECTED] added the comment:
My main fear with this patch is that unquote will become seen as
unreliable, because naive software trying to parse URLs will encounter
uses of percent-encoding where the encoded octets are not in fact UTF-8
bytes. They're just some set of
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment:
On 2008-08-07 23:17, Bill Janssen wrote:
Bill Janssen [EMAIL PROTECTED] added the comment:
My main fear with this patch is that unquote will become seen as
unreliable, because naive software trying to parse URLs will encounter
uses of
Guido van Rossum [EMAIL PROTECTED] added the comment:
On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen [EMAIL PROTECTED] wrote:
Bill Janssen [EMAIL PROTECTED] added the comment:
My main fear with this patch is that unquote will become seen as
unreliable, because naive software trying to parse
Bill Janssen [EMAIL PROTECTED] added the comment:
Your original proposal was to make unquote() behave like
unquote_to_bytes(), which would require changes to virtually every app
using unqote(), since almost all apps assume the result is a (text)
string.
Actually, careful apps realize that
Guido van Rossum [EMAIL PROTECTED] added the comment:
On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen [EMAIL PROTECTED] wrote:
Bill Janssen [EMAIL PROTECTED] added the comment:
Your original proposal was to make unquote() behave like
unquote_to_bytes(), which would require changes to virtually
Bill Janssen [EMAIL PROTECTED] added the comment:
On Thu, Aug 7, 2008 at 4:23 PM, Guido van Rossum [EMAIL PROTECTED]wrote:
However I fear that this middle ground will in practice cause:
(a) more in-the-field failures, since devs are notorious for testing
with ASCII only; and
Bill Janssen [EMAIL PROTECTED] added the comment:
Now I'm looking at the failing test_http_cookiejar test, which fails
because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI.
The question is, does the http URI scheme allow non-ASCII (say,
Latin-1) octets in path segments? IANA
Changes by Bill Janssen [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11078/unnamed
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Bill Janssen [EMAIL PROTECTED] added the comment:
Looks like the failing test in test_http_cookiejar is just a bad test;
it attempts to build an HTTP request object from an invalid URL, yet
still seem to expect to be able to extract a cookie from the response
headers for that request. I'd
Bill Janssen [EMAIL PROTECTED] added the comment:
Just to be clear: any octet would seem to be allowed in the path of
an http URL, but any non-ASCII octet must be percent-encoded. So the
URL itself is still an ASCII string, considered opaquely.
___
Python
Bill Janssen [EMAIL PROTECTED] added the comment:
Here's my version of how quote and unquote should be implemented in
Python 3.0. I haven't looked at the uses of it in the library, but I'd
expect improper uses (and there are lots of them) will break, and thus
can be fixed.
Basically,
Changes by Bill Janssen [EMAIL PROTECTED]:
Removed file: http://bugs.python.org/file11062/myunquote.py
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
Bill Janssen [EMAIL PROTECTED] added the comment:
Here's a patch to parse.py (and test/test_urllib.py) that makes the
various tests (cgi, urllib, httplib) pass. It basically adds
unquote_as_string, unquote_as_bytes, quote_as_string,
quote_as_bytes, and then define the existing quote and unquote
Jim Jewett [EMAIL PROTECTED] added the comment:
Is there still disagreement over anything except:
(1) The type signature of quote and unquote (as opposed to the
explicit quote_as_bytes or quote_as string).
(2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do
with invalid
Antoine Pitrou [EMAIL PROTECTED] added the comment:
Bill, I haven't studied your patch in detail but a few comments:
- it would be nice to have more unit tests, especially for the various
bytes/unicode possibilities, and perhaps also roundtripping (Matt's
patch has a lot of tests)
-
Guido van Rossum [EMAIL PROTECTED] added the comment:
Bill Janssen's patch breaks two unittests: test_email and
test_http_cookiejar. Details for test_email:
==
ERROR: test_rfc2231_bad_character_in_filename
Jim Jewett [EMAIL PROTECTED] added the comment:
Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I
assume Bill could patch his patch the same way Matt did, and this would
resolve the email tests. (Unless you pronounce to stick with Latin-1)
The cookiejar failure
Guido van Rossum [EMAIL PROTECTED] added the comment:
Dear GvR,
New code review comments by GvR have been published.
Please go to http://codereview.appspot.com/2827 to read them.
Message:
Hi Matt,
Here's a code review of your patch.
I'm leaning more and more towards wanting this for 3.0, but
Jim Jewett [EMAIL PROTECTED] added the comment:
http://codereview.appspot.com/2827/diff/1/5#newcode1450
Line 1450: %3c%3c%0Anew%C3%A5/%C3%A5,
I'm guessing this test broke otherwise?
Yes; that is one of the breakages you found in Bill's patch. (He didn't
modify the test.)
Given that
Matt Giuca [EMAIL PROTECTED] added the comment:
OK after a long discussion on the mailing list, Guido gave this the OK,
with the provision that there are str-bytes and bytes-str versions of
these functions as well. So I've written those.
Matt Giuca [EMAIL PROTECTED] added the comment:
Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's
minor, gives a harmless BytesWarning if you run with -b, which make
test does, so I only picked it up after submitting).
I've slightly changed the code in quote so it doesn't
Matt Giuca [EMAIL PROTECTED] added the comment:
OK I spent awhile writing test cases for quote and unquote, encoding and
decoding various Unicode strings with different encodings. As a result,
I found a bunch of issues in my previous patch, so I've rewritten the
patches to both quote and
Matt Giuca [EMAIL PROTECTED] added the comment:
So today I grepped for urllib in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or
Matt Giuca [EMAIL PROTECTED] added the comment:
3.0b1 has been released, so no new features can be added to 3.0.
While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a new feature. This is very definitely a bug. As I
understand it, the point of a code
Matt Giuca [EMAIL PROTECTED] added the comment:
Since I got a complaint that my last reply was too long, I'll summarize it.
It's a bug report, not a feature request.
I can't get a simple web app to be properly Unicode-aware in Python 3,
which worked fine in Python 2. This cannot be put off
Matt Giuca [EMAIL PROTECTED] added the comment:
Setting Version back to Python 3.0. Is there a reason it was set to
Python 3.1? This proposal will certainly break a lot of code. It's *far*
better to do it in the big backwards-incompatible Python 3.0 release
than a later release.
--
Martin v. Löwis [EMAIL PROTECTED] added the comment:
Setting Version back to Python 3.0. Is there a reason it was set to
Python 3.1?
3.0b1 has been released, so no new features can be added to 3.0.
___
Python tracker [EMAIL PROTECTED]
Tom Pinckney [EMAIL PROTECTED] added the comment:
I mentioned this is in a brief python-dev discussion earlier this
spring, but many popular websites such as Wikipedia and Facebook do use
UTF-8 as their character encoding scheme for the path and argument
portion of URLs.
I know there's no
Matt Giuca [EMAIL PROTECTED] added the comment:
OK I've gone back over the patch and decided to add the encoding and
errors arguments from the str.encode/decode methods as optional
arguments to quote and unquote. This is a much bigger change than I
originally intended, but I think it makes
Changes by Martin v. Löwis [EMAIL PROTECTED]:
--
versions: +Python 3.1 -Python 3.0
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Martin v. Löwis [EMAIL PROTECTED] added the comment:
Assuming the patch is acceptable in the first place (which I personally
have not made my mind up), then it lacks documentation and test suite
changes.
___
Python tracker [EMAIL PROTECTED]
Matt Giuca [EMAIL PROTECTED] added the comment:
OK well here are the necessary changes to the documentation (RST docs
and docstrings in the code).
As I said above, I plan to to extensive testing and add new cases, and I
don't recommend this patch is accepted until that's done.
Patch
Changes by Senthil [EMAIL PROTECTED]:
--
nosy: +orsenthil
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
New submission from Matt Giuca [EMAIL PROTECTED]:
Three Unicode-related problems with urllib.parse.quote and
urllib.parse.unquote in Python 3.0. (Patch attached).
Firstly, unquote appears not to have been modified from Python 2, where
it is designed to output a byte string. In Python 3, it
Martin v. Löwis [EMAIL PROTECTED] added the comment:
RFC 3986 states that the percent-encoded byte
values should be decoded as UTF-8.
Where precisely do you read such a SHOULD requirement?
Section 2.5 elaborates that the local encoding (of the
resource) is typically used, ignoring cases where
Matt Giuca [EMAIL PROTECTED] added the comment:
Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be
used. Since we're outputting a Unicode string in Python 3, we need to
decode with some encoding, and UTF-8 seems the most sensible and
standardised.
(Even the existing test
89 matches
Mail list logo