[issue3300] urllib.quote and unquote - Unicode issues

2008-08-20 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: There's an unquote()-related failure in #3613. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-20 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Thanks for pointing that out, Antoine. I just commented on that bug. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-18 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Hi, Sorry to bump this, but you (Guido) said you wanted this closed by Wednesday. Is this patch committable yet? (There are no more unresolved issues that I am aware of). ___ Python tracker [EMAIL PROTECTED]

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-18 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Looking into this now. Will make sure it's included in beta3. -- assignee: - gvanrossum priority: - release blocker ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-18 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Checked in patch 10 with minor style changes as r65838. Thanks Matt for persevering! Thanks everyone else for contributing; this has been quite educational. -- resolution: - accepted status: open - closed

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Ah cheers Antoine, for the tip on using defaultdict (I was confused as to how I could access the key just by passing defaultfactory, as the manual suggests). ___ Python tracker [EMAIL PROTECTED]

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: OK I implemented the defaultdict solution. I got curious so ran some rough speed tests, using the following code. import random, urllib.parse for i in range(0, 10): str = ''.join(chr(random.randint(0, 0x10)) for _ in range(50))

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Hello Matt, OK I implemented the defaultdict solution. I got curious so ran some rough speed tests, using the following code. import random, urllib.parse for i in range(0, 10): str = ''.join(chr(random.randint(0, 0x10)) for _

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: New patch (patch10). Details on Rietveld review tracker (http://codereview.appspot.com/2827). Another update on the remaining outstanding issues: Resolved issues since last time: Should unquote accept a bytes/bytearray as well as a str? No. But

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Antoine: I think if you move the line defining str out of the loop, relative timings should change quite a bit. Chances are that the random functions are not very fast, since they are written in pure Python. Well I wanted to test throwing lots

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: I have no strong opinion on the very remaining points you listed, except that IMHO encode_rfc2231 with charset=None should not try to use UTF8 by default. But someone with more mail protocol skills should comment :) OK I've come to the

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Marc-Andre Lemburg
Changes by Marc-Andre Lemburg [EMAIL PROTECTED]: -- nosy: -lemburg ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___ Python-bugs-list mailing

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca [EMAIL PROTECTED] wrote: I have no strong opinion on the very remaining points you listed, except that IMHO encode_rfc2231 with charset=None should not try to use UTF8 by default. But someone

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: I'm OK with replace for unquote() ... For quote() I think strict is better There's just an odd inconsistency there, but it's only a tiny gotcha; and I agree with all your other arguments. I'll change unquote back to errors='replace'. This

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Selon Matt Giuca [EMAIL PROTECTED]: Now that you've spent so much time with this patch, can't you think of a faster way of doing this? Well firstly, you could replace Quoter (the class) with a quoter function, which is nested inside

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Selon Antoine Pitrou [EMAIL PROTECTED]: As for the defaultdict, here is how it can look like (this is on 2.5): (there should be a line here saying class D(defaultdict) :-)) ... def __missing__(self, key): ... print __missing__, key

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Now that you've spent so much time with this patch, can't you think of a faster way of doing this? Well firstly, you could replace Quoter (the class) with a quoter function, which is nested inside quote. Would calling a nested function

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Feel free to take the function implementation from my patch, if it speeds things up (and it should). Bill On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum [EMAIL PROTECTED]wrote: Guido van Rossum [EMAIL PROTECTED] added the comment: Now

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Erik van der Poel at Google has now chimed in with stats on current URL usage: ``...the bottom line is that escaped non-utf-8 is still quite prevalent, enough (in my opinion) to require an implementation in Python, possibly even allowing for

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Bill Janssen [EMAIL PROTECTED] added the comment: Erik van der Poel at Google has now chimed in with stats on current URL usage: ``...the bottom line is that escaped non-utf-8 is still quite prevalent, enough (in my opinion) to require

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Le mercredi 13 août 2008 à 17:05 +, Bill Janssen a écrit : I think it's worth remembering that a very large proportion of the use of Python's urllib.unquote() is in implementations of Web server frameworks of one sort or another. We

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou [EMAIL PROTECTED]wrote: Antoine Pitrou [EMAIL PROTECTED] added the comment: Le mercredi 13 août 2008 à 17:05 +, Bill Janssen a écrit : I think it's worth remembering that a very large

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Guido van Rossum
Changes by Guido van Rossum [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11107/unnamed ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Guido van Rossum
Changes by Guido van Rossum [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11106/unnamed ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Bill, this debate is getting snipy, and going nowhere. We could argue about what is the pure and correct thing to do, but we have a limited time frame here, so I suggest we just look at the important facts. 1. There is an overwhelming consensus

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: By the way, what is the current status of this bug? Is anybody waiting on me to do anything? (Re: Patch 9) To recap my previous list of outstanding issues raised by the review: Should unquote accept a bytes/bytearray as well as a str? Currently,

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: I agree that given two similar patches, the one with more tests earns some bonus points. Also, it seems to me that round-trippability of quote()/unquote() is a logical and semantic requirement: in particular, if there is a default encoding, it

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Larry Masinter is off on vacation, but I did get a brief message saying that he will dig up similar discussions that he was involved in when he gets back. Out of curiosity, I sent a note off to the www-international mailing list, and received

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: For Antoine: I think the problem that Barry is facing with the email package is that Unicode strings are an ambiguous representation of a sequence of bytes; that is, there are a number of different byte sequences a Unicode string may have come

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Here's another thought: Let's put string_to_bytes and string_from_bytes into the binascii module, as a2b_percent and b2a_percent, respectively. Then parse.py would import them as from binascii import a2b_percent as percent_decode_as_bytes

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Le mardi 12 août 2008 à 19:37 +, Bill Janssen a écrit : Let's put string_to_bytes and string_from_bytes into the binascii module, as a2b_percent and b2a_percent, respectively. Well, it's my personal opinion, but I think we should focus

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Matt Giuca [EMAIL PROTECTED] added the comment: By the way, what is the current status of this bug? Is anybody waiting on me to do anything? (Re: Patch 9) I'll be reviewing it today or tomorrow. From looking at it briefly I worry that the

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-11 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca [EMAIL PROTECTED] wrote: Matt Giuca [EMAIL PROTECTED] added the comment: Bill, I had a look at your patch. I see you've decided to make quote_as_string the default? In that case, I don't know why

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-11 Thread Bill Janssen
Changes by Bill Janssen [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11101/unnamed ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-11 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Some interesting notes here (from Erik van der Poel at Google; Guido, you might want to stroll over to his location and talk with him): http://lists.w3.org/Archives/Public/www-international/2007JanMar/0004.html and more particularly

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Guido suggested that quote's safe parameter should allow any character, not just ASCII range. I've implemented this now. It was a lot messier than I imagined. The problem is that in my older patches, both 's' and 'safe' are encoded to bytes right

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Made a bunch of requested changes (I've reverted the all safe patch for now since it caused so much grief; see above). * quote: Fixed encoding illegal % sequences (and lots of new test cases to prove it). * quote now throws a type error if s is

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Le dimanche 10 août 2008 à 07:05 +, Matt Giuca a écrit : I don't think it's worth the extra code bloat and performance hit just to implement a feature whose only use is producing invalid URIs (since URIs are supposed to only have ASCII

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Invalid user input? What if the query string comes from filling a form? For example if I search the word numéro in a latin1 Web site, I get the following URL: http://www.le-tigre.net/spip.php?page=rechercherecherche=num%E9ro Yes, that is a

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Bill, I had a look at your patch. I see you've decided to make quote_as_string the default? In that case, I don't know why you had to rewrite everything to implement the same basic behaviour as my patch. (My latest few patches support bytes both

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Jim Jewett
Jim Jewett [EMAIL PROTECTED] added the comment: Matt, Bill's main concern is with a policy decision; I doubt he would object to using your code once that is resolved. The purpose of the quoting functions is to turn a string (representing the human-readable version) into bytes (that go over

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Bill's main concern is with a policy decision; I doubt he would object to using your code once that is resolved. But his patch does the same basic operations as mine, just implemented differently and with the heap of issues I outlined above. So

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: I've been thinking more about the errors=strict default. I think this was Guido's suggestion. I've decided I'd rather stick with errors=replace. I changed errors=replace to errors=strict in patch 8, but now I'm worried that will cause problems,

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-08 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Here's the updated version of my patch. It returns a string, but doesn't clobber bytes that are contained in the string. Naive code (basically code that expects ASCII strings from unquote) should continue to work as well as it ever did. I

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-08 Thread Bill Janssen
Changes by Bill Janssen [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11064/patch ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Dear GvR, New code review comments by mgiuca have been published. Please go to http://codereview.appspot.com/2827 to read them. Message: Hi Guido, Thanks very much for this very detailed review. I've replied to the comments. I will make the

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: A reply to a point on GvR's review, I'd like to open for discussion. This relates to whether or not quote's safe argument should allow non-ASCII characters. Using errors='ignore' seems like a mistake -- it will hide errors. I also wonder why

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Le jeudi 07 août 2008 à 13:42 +, Matt Giuca a écrit : The reasoning is this: if we allow non-ASCII characters to be escaped, then we allow quote to generate invalid URIs (URIs are only allowed to have ASCII characters). It's one thing

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: The important is that the defaults are safe. If users want to override the defaults and produce potentially invalid URIs, there is no reason to discourage them. OK I think that's a fairly valid argument. I'm about to head off so I'll post the

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Following Guido and Antoine's reviews, I've written a new patch which fixes *most* of the issues raised. The ones I didn't fix I have noted below, and commented on the review site (http://codereview.appspot.com/2827/). Note: I intend to address all

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: I'm also attaching a metapatch - diff from patch 7 to patch 8. This is to give a rough idea of what I changed since the review. (Sorry - This is actually a diff between the two patches, so it's pretty hard to read. It would have been nicer to diff

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: On Thu, Aug 7, 2008 at 8:03 AM, Matt Giuca [EMAIL PROTECTED] wrote: Matt Giuca [EMAIL PROTECTED] added the comment: I'm also attaching a metapatch - diff from patch 7 to patch 8. This is to give a rough idea of what I changed since the

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Just to reply to Antoine's comments on my patch: - it would be nice to have more unit tests, especially for the various bytes/unicode possibilities, and perhaps also roundtripping (Matt's patch has a lot of tests) Yes, I completely agree. -

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: My main fear with this patch is that unquote will become seen as unreliable, because naive software trying to parse URLs will encounter uses of percent-encoding where the encoded octets are not in fact UTF-8 bytes. They're just some set of

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Marc-Andre Lemburg
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: On 2008-08-07 23:17, Bill Janssen wrote: Bill Janssen [EMAIL PROTECTED] added the comment: My main fear with this patch is that unquote will become seen as unreliable, because naive software trying to parse URLs will encounter uses of

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen [EMAIL PROTECTED] wrote: Bill Janssen [EMAIL PROTECTED] added the comment: My main fear with this patch is that unquote will become seen as unreliable, because naive software trying to parse

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Your original proposal was to make unquote() behave like unquote_to_bytes(), which would require changes to virtually every app using unqote(), since almost all apps assume the result is a (text) string. Actually, careful apps realize that

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen [EMAIL PROTECTED] wrote: Bill Janssen [EMAIL PROTECTED] added the comment: Your original proposal was to make unquote() behave like unquote_to_bytes(), which would require changes to virtually

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: On Thu, Aug 7, 2008 at 4:23 PM, Guido van Rossum [EMAIL PROTECTED]wrote: However I fear that this middle ground will in practice cause: (a) more in-the-field failures, since devs are notorious for testing with ASCII only; and

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Now I'm looking at the failing test_http_cookiejar test, which fails because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI. The question is, does the http URI scheme allow non-ASCII (say, Latin-1) octets in path segments? IANA

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Changes by Bill Janssen [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11078/unnamed ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Looks like the failing test in test_http_cookiejar is just a bad test; it attempts to build an HTTP request object from an invalid URL, yet still seem to expect to be able to extract a cookie from the response headers for that request. I'd

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Just to be clear: any octet would seem to be allowed in the path of an http URL, but any non-ASCII octet must be percent-encoded. So the URL itself is still an ASCII string, considered opaquely. ___ Python

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Here's my version of how quote and unquote should be implemented in Python 3.0. I haven't looked at the uses of it in the library, but I'd expect improper uses (and there are lots of them) will break, and thus can be fixed. Basically,

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Bill Janssen
Changes by Bill Janssen [EMAIL PROTECTED]: Removed file: http://bugs.python.org/file11062/myunquote.py ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Bill Janssen
Bill Janssen [EMAIL PROTECTED] added the comment: Here's a patch to parse.py (and test/test_urllib.py) that makes the various tests (cgi, urllib, httplib) pass. It basically adds unquote_as_string, unquote_as_bytes, quote_as_string, quote_as_bytes, and then define the existing quote and unquote

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Jim Jewett
Jim Jewett [EMAIL PROTECTED] added the comment: Is there still disagreement over anything except: (1) The type signature of quote and unquote (as opposed to the explicit quote_as_bytes or quote_as string). (2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do with invalid

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Antoine Pitrou
Antoine Pitrou [EMAIL PROTECTED] added the comment: Bill, I haven't studied your patch in detail but a few comments: - it would be nice to have more unit tests, especially for the various bytes/unicode possibilities, and perhaps also roundtripping (Matt's patch has a lot of tests) -

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Bill Janssen's patch breaks two unittests: test_email and test_http_cookiejar. Details for test_email: == ERROR: test_rfc2231_bad_character_in_filename

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Jim Jewett
Jim Jewett [EMAIL PROTECTED] added the comment: Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I assume Bill could patch his patch the same way Matt did, and this would resolve the email tests. (Unless you pronounce to stick with Latin-1) The cookiejar failure

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Guido van Rossum
Guido van Rossum [EMAIL PROTECTED] added the comment: Dear GvR, New code review comments by GvR have been published. Please go to http://codereview.appspot.com/2827 to read them. Message: Hi Matt, Here's a code review of your patch. I'm leaning more and more towards wanting this for 3.0, but

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Jim Jewett
Jim Jewett [EMAIL PROTECTED] added the comment: http://codereview.appspot.com/2827/diff/1/5#newcode1450 Line 1450: %3c%3c%0Anew%C3%A5/%C3%A5, I'm guessing this test broke otherwise? Yes; that is one of the breakages you found in Bill's patch. (He didn't modify the test.) Given that

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: OK after a long discussion on the mailing list, Guido gave this the OK, with the provision that there are str-bytes and bytes-str versions of these functions as well. So I've written those.

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's minor, gives a harmless BytesWarning if you run with -b, which make test does, so I only picked it up after submitting). I've slightly changed the code in quote so it doesn't

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: OK I spent awhile writing test cases for quote and unquote, encoding and decoding various Unicode strings with different encodings. As a result, I found a bunch of issues in my previous patch, so I've rewritten the patches to both quote and

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: So today I grepped for urllib in the entire library in an effort to track down every dependency on quote and unquote to see exactly how my patch breaks other code. I've now investigated every module in the library which uses quote, unquote or

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-11 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: 3.0b1 has been released, so no new features can be added to 3.0. While my proposal is no doubt going to cause a lot of code breakage, I hardly consider it a new feature. This is very definitely a bug. As I understand it, the point of a code

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-11 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Since I got a complaint that my last reply was too long, I'll summarize it. It's a bug report, not a feature request. I can't get a simple web app to be properly Unicode-aware in Python 3, which worked fine in Python 2. This cannot be put off

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-10 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Setting Version back to Python 3.0. Is there a reason it was set to Python 3.1? This proposal will certainly break a lot of code. It's *far* better to do it in the big backwards-incompatible Python 3.0 release than a later release. --

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-10 Thread Martin v. Löwis
Martin v. Löwis [EMAIL PROTECTED] added the comment: Setting Version back to Python 3.0. Is there a reason it was set to Python 3.1? 3.0b1 has been released, so no new features can be added to 3.0. ___ Python tracker [EMAIL PROTECTED]

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-09 Thread Tom Pinckney
Tom Pinckney [EMAIL PROTECTED] added the comment: I mentioned this is in a brief python-dev discussion earlier this spring, but many popular websites such as Wikipedia and Facebook do use UTF-8 as their character encoding scheme for the path and argument portion of URLs. I know there's no

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-09 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: OK I've gone back over the patch and decided to add the encoding and errors arguments from the str.encode/decode methods as optional arguments to quote and unquote. This is a much bigger change than I originally intended, but I think it makes

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-09 Thread Martin v. Löwis
Changes by Martin v. Löwis [EMAIL PROTECTED]: -- versions: +Python 3.1 -Python 3.0 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-09 Thread Martin v. Löwis
Martin v. Löwis [EMAIL PROTECTED] added the comment: Assuming the patch is acceptable in the first place (which I personally have not made my mind up), then it lacks documentation and test suite changes. ___ Python tracker [EMAIL PROTECTED]

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-09 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: OK well here are the necessary changes to the documentation (RST docs and docstrings in the code). As I said above, I plan to to extensive testing and add new cases, and I don't recommend this patch is accepted until that's done. Patch

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-08 Thread Senthil
Changes by Senthil [EMAIL PROTECTED]: -- nosy: +orsenthil ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3300 ___ ___ Python-bugs-list mailing list

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-06 Thread Matt Giuca
New submission from Matt Giuca [EMAIL PROTECTED]: Three Unicode-related problems with urllib.parse.quote and urllib.parse.unquote in Python 3.0. (Patch attached). Firstly, unquote appears not to have been modified from Python 2, where it is designed to output a byte string. In Python 3, it

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-06 Thread Martin v. Löwis
Martin v. Löwis [EMAIL PROTECTED] added the comment: RFC 3986 states that the percent-encoded byte values should be decoded as UTF-8. Where precisely do you read such a SHOULD requirement? Section 2.5 elaborates that the local encoding (of the resource) is typically used, ignoring cases where

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-06 Thread Matt Giuca
Matt Giuca [EMAIL PROTECTED] added the comment: Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be used. Since we're outputting a Unicode string in Python 3, we need to decode with some encoding, and UTF-8 seems the most sensible and standardised. (Even the existing test