[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2017-06-04 Thread karl

karl added the comment:

#8143 was fixed.

Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.unquote(u"%CE%A3")
u'\xce\xa3'

What should become of this one?

--
nosy: +karlcow

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2013-01-04 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +serhiy.storchaka
versions:  -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-07-17 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +merwok, orsenthil

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

The 'unquote' function has some very strange behaviour on Unicode input. My 
proposed fix will, I am sure, be contentious, because it could change existing 
behaviour (only on unicode strings), but I think it's worth it for a sane 
unquote function.

Some historical context: I already reported this bug in Python 3 as 
http://bugs.python.org/issue3300 (or part of it). We argued for two months. I 
rewrote the function, and Guido accepted it. The same bugs are present in 
Python 2, but less urgent since they only affect 'unicode' strings (whereas in 
Python 3 they affected all strings).

PROBLEM DESCRIPTION

The basic problem is this:
Current behaviour:
 urllib.unquote(u%CE%A3)
u'\xce\xa3'
(or u'Σ')

Desired behaviour:
 urllib.unquote(u%CE%A3)
'\xce\xa3'
(Which decodes with UTF-8 to u'Σ')

Basically, if you give unquote a unicode string, it will give you back a 
unicode string, with all of the percent-escapes decoded with Latin-1. This 
behaviour was added in r39728. The following line was added:

res[i] = unichr(int(item[:2], 16)) + item[2:]

It takes a percent-escape (e.g., CE), converts it to an int (e.g., 0xCE), 
then calls unichr to form a Unicode character with that codepoint (e.g., 
u'\u00ce'). That's totally wrong. A URI percent-escape is used to represent a 
data octet [RFC 3986], not a Unicode code point.

I would argue that the result of unquote should always be a str, no matter the 
input. Since a URI represents a byte sequence, not a character sequence, 
unquote of a unicode should return a byte string, which the user can then 
decode as desired.

Note that in Python 3 we didn't have a choice, since all strings are unicode, 
we used a much more complicated solution. But we also added a function 
unquote_to_bytes. Python 2's unquote should behave like Python 3's 
unquote_to_bytes.

PROPOSED SOLUTION

To that end, my proposed solution is simply to encode the input unicode string 
with UTF-8, which is exactly what Python 3's unquote_to_bytes function does. I 
have attached a patch which does this. It is thoroughly tested and documented. 
However, see the discussion of potential problems later.

WHY THE CURRENT BEHAVIOUR IS BAD

I'll also point out that the patch in r39728 which created this problem also 
added a test case, still there, which demonstrates just how confusing this 
behaviour is:

r = urllib.unquote(u'br%C3%BCckner_sapporo_20050930.doc')
self.assertEqual(r, u'br\xc3\xbcckner_sapporo_20050930.doc')

This takes a string, clearly meant to be a UTF-8-encoded percent-escaped string 
for u'brückner_sapporo_20050930.doc', and unquotes it. Because of this bug, it 
is decoded with Latin-1 to the string 'brückner_sapporo_20050930.doc'. And 
this garbled string is *actually the expected output of the test case*!!

Furthermore, this behaviour is very confusing because it breaks equality of 
ASCII str and unicode objects. Consider:

 %C3%BC == u%C3%BC
True
 urllib.unquote(%C3%BC)
'\xc3\xbc'
 urllib.unquote(u%C3%BC)
u'\xc3\xbc'
 urllib.unquote(%C3%BC) == urllib.unquote(u%C3%BC)
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
False

Why should the ASCII str object %C3%BC encode to one value, while the ASCII 
unicode object u%C3%BC encode to another? The two inputs represent the same 
string, so they should produce the same output.

POTENTIAL PROBLEMS

The attached patch will not, to my knowledge, affect any calls to unquote with 
str input. It only changes unicode input. Since this was buggy anyway, I think 
it's a legitimate fix.

I am, however, concerned about code breakage for existing code which uses 
unicode strings and depends upon this behaviour. Some use cases:

1. Unquoting a unicode string which is pure ASCII, with pure ASCII 
percent-escapes. This previously would produce a pure ASCII unicode string, now 
produces a pure ASCII str. This shouldn't break anything unless some code 
specifically checks that strings are of type 'unicode' (e.g., the Storm 
database library).
2. Unquoting a unicode string with pure ASCII percent-escapes, but non-ASCII 
characters. This previously would preserve all the unescaped characters; they 
will now be encoded to UTF-8. Technically this should never happen, as URIs are 
not allowed to contain non-ASCII characters [RFC 3986].
3. Unquoting a unicode string which is pure ASCII, with non-ASCII percent 
escapes. Some code may rely on the implicit decoding as Latin-1. However, I 
think it is more likely that existing code would just break, since most URIs 
are UTF-8 encoded.

TWO SOLUTIONS

Having gone through the problems, I imagine that we may reach the conclusion 
that it is too dangerous to fix this bug. Therefore, I am proposing an 
alternate solution (which I will attach in a follow-up comment), which is not 
to change the code at all. Instead, just fix the broken test case and add lots 
more test cases, and also 

[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Alternative patch which fixes test cases and documentation without changing the 
behaviour.

--
Added file: http://bugs.python.org/file16542/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Changes by Matt Giuca matt.gi...@gmail.com:


Removed file: http://bugs.python.org/file16542/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

New version of explain patch -- fixed comment linking to the wrong bug ID -- 
now links to this bug ID (#8136).

--
Added file: http://bugs.python.org/file16545/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
priority:  - normal
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Oh, I just discovered that urlparse contains a copy of unquote, which will also 
need to be patched. I've submitted a patch to remove the duplicate (#8143) -- 
if that is accepted first then there's no need to worry about it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com