[issue3565] array documentation, method names not 3.x-compliant

2011-07-12 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

There are still some inconsistencies in the documentation (in particular, 
incorrectly using the word string to refer to a bytes object, which made 
sense in Python 2 but not 3), which I fixed in my doc-only.patch file that's 
coming up to its third birthday.

Most of it has been fixed with the previous change which added 'tobytes' and 
'frombytes' and made tostring and fromstring aliases. But there are some places 
which don't make sense:

array: If given a list or string needs to be If given a list, bytes or 
string (since a bytes is not a string).
frombytes: Appends items from the string needs to be Appends items from the 
bytes object, since this does not work if you give it a string.

Less importantly, I also recommended renaming unicode string to just 
string, since in Python 3 there is no such thing as a non-unicode string. For 
instance, there is an example that uses a variable named unicodestring that 
could be renamed to just string.

 Indeed, not only it would bring little benefit, but may also confuse
 users porting from 2.x (since the from/tostring methods would then
 have a totally different meaning).
Well, by that logic, you shouldn't have renamed unicode to str since that 
would also confuse users porting from 2.x. It generally seems like a good idea 
in Python 3 to rename all mentions of string to bytes and all mentions of 
unicode to string, so as to be consistent with the new names of the types 
(it is better to be internally consistent than consistent with the previous 
version).

Though I do agree that it would be chaos to rename array.from/tounicode to 
from/tostring now, given that array.from/tostring already has a different 
meaning in Python 3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-12-29 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

 I think that we have good reasons to not remove the NUL character.

Please note: Nobody is suggesting that we remove the NUL character. I was 
merely suggesting that we don't rely on it where it is unnecessary.

Returning to my original patch: If the code was using the NUL character as a 
terminator, then it wouldn't be a bug.

What the repr code does is it uses the length, and does not explicitly search 
for a NUL character. However, there is a *bug* where it reads one too many 
characters in certain cases. As I said in the first place, it just happens to 
*not* be catastrophic due to the presence of the NUL character. But that does 
not mean this isn't a bug -- at the very least, the code is very confusing to 
read because it does not do what it is trying to do.

Anyway the important issue is what Marc-Andre raised about buffers. Since we 
are in agreement that there is a potential problem here, and I have a patch 
which seems correct and doesn't break any test cases (note my above post 
responding to test case breakages), can it be applied?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-08-02 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

OK, I finally had time to review this issue again.

Firstly, granted the original fix broke a test case, shouldn't we figure out 
why it broke and fix it, rather than just revert the change and continue 
relying on this tenuous assumption? Surely it's best to have as little code 
relying on it as possible.

Secondly, please have a look at my patch again. It wasn't committed properly -- 
no offense to Georg, it's an honest mistake! My patch was:

--- Objects/unicodeobject.c (revision 81539)
+++ Objects/unicodeobject.c (working copy)
@@ -3065,7 +3065,7 @@
 }
 #else
 /* Map UTF-16 surrogate pairs to '\U00xx' */
-else if (ch = 0xD800  ch  0xDC00) {
+else if (ch = 0xD800  ch  0xDC00  size  0) {
 Py_UNICODE ch2;
 Py_UCS4 ucs;

The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k):
http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980r2=83418

--- Objects/unicodeobject.c (revision 83417)
+++ Objects/unicodeobject.c (revision 83418)
@@ -3067,7 +3067,7 @@
 
 ch2 = *s++;
 size--;
-if (ch2 = 0xDC00  ch2 = 0xDFFF) {
+if (ch2 = 0xDC00  ch2 = 0xDFFF  size) {
 ucs = (((ch  0x03FF)  10) | (ch2  0x03FF)) + 0x0001;
 *p++ = '\\';
 *p++ = 'U';
@@ -3316,7 +3316,7 @@
 
 ch2 = *s++;
 size--;
-if (ch2 = 0xDC00  ch2 = 0xDFFF) {
+if (ch2 = 0xDC00  ch2 = 0xDFFF  size) {
 ucs = (((ch  0x03FF)  10) | (ch2  0x03FF)) + 
0x0001;
 *p++ = '\\';
 *p++ = 'U';

I put the size check on the first character of the surrogate pair; in the 
committed version the size check was on the second character (after the size 
variable is decremented), causing it to break out of that branch too early in 
some cases.

Moving the size check to the outer if block fixes the test breakage.

PS. Good find on the second copy of that code in the 
PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which 
fixes both functions instead of just the unicodeescape_string function.

Passes all test cases on UCS2 build of the 2.7 branch.

--
Added file: http://bugs.python.org/file18322/unicode-range-check2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-21 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

If you're going the way of option 2, I would strongly advise against relying on 
the KeyError. The fact that a KeyError is raised by urllib.quote is not part of 
it's specification, it's a bug/quirk in the implementation (which is now 
unlikely to be change, but it's unsafe to rely on it).

Robotparser should encode the string, if and only if it is a unicode string, 
with ('ascii', 'strict'), catch the UnicodeEncodeError, and raise the TypeError 
you suggested. This will have precisely the same behaviour as your proposed 
option 2 (will work fine for byte strings and Unicode strings with ASCII-only 
characters, but raise a TypeError on Unicode strings with non-ASCII characters) 
without relying on the KeyError from urllib.quote.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-19 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

From http://mail.python.org/pipermail/python-checkins/2010-July/095350.html:
 Looking at the issue (which in itself was quite old), you could as well
 have fixed the robotparser module instead.

It isn't an issue with robotparser. The original reporter found it via 
robotparser, but it's nothing to do with that module. I found it independently 
and I would have reported it separately if it hadn't already been here.

It's definitely a bug in urllib (as shown by my extensive new test cases).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-19 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

 Well, isn't it a new feature you're adding?

You had a function which raised a confusing and unintentional KeyError when 
given non-ASCII Unicode input. Now it doesn't. That's the bug fix part.

What I assume you're referring to as a new feature is the new arguments. I'd 
say they're unfortunately necessary in fixing this bug, as the fix requires 
encoding the non-ASCII unicode characters with some encoding, and it's 
(arguably) necessary to give the programmer the choice of encoding, with 
sensible defaults.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-19 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

 I think everyone assumed that the parameter should be a str object
 and nothing else. Apparently some people used it accidentally with
 some unicode strings and it worked until these strings contained
 non-ASCII characters.

I don't consider use of Unicode strings in Python 2.7 to be accidental. In my 
experience with Python 2, pretty much everything already works with Unicode 
strings, and it's best practice to use them.

Now one of the major goals of Python 2.6/2.7 is to allow the writing of code 
which ports smoothly to Python 3. Unicode support is a major issue here. To 
quote What's new in Python 3 (http://docs.python.org/py3k/whatsnew/3.0.html):
To be prepared in Python 2.x, start using unicode for all unencoded text, and 
str for binary or encoded data only. Then the 2to3  tool will do most of the 
work for you.
Having functions in Python 2.7 which don't accept Unicode (or worse, raise 
random exceptions) runs against best practices for moving to Python 3.

 If we were following you, we would add encoding and errors arguments
 to any str-accepting 2.x function, so that it can also accept unicode
 strings. That's certainly not a reasonable solution.

No, that's certainly not necessary. You don't need an encoding or errors 
argument on any given function in order to support unicode. In fact, most code 
written to work with strings naturally works with Unicode because unicode 
strings support the same basic operations.

The need for an encoding and errors, and in fact the need to deal with 
string encoding at all with urllib.quote is due to the special nature of URLs. 
If URLs had a syntax like %u then there would be no need for encoding 
Unicode strings (as in UTF-8) at all. However, because the RFC specifies that 
Unicode strings are to be encoded into a byte sequence *using an unspecified 
encoding*, it is therefore necessary, for this specific function, to ask the 
programmer which encoding to use.

Thus I assure you, this is not just one random function I have picked to add 
these arguments to. This is the only one (that I know of) that requires them to 
support Unicode.

 The original issue is against robotparser, and clearly states a bug
 (robotparser doesn't work in some cases).

I don't know why this keeps coming back to robotparser. The original bug was 
not against robotparser; it is called quote throws exception on Unicode URL 
and that is the bug. Robotparser was just one demonstrative piece of code which 
failed because of it.

Having said that, I don't expect to continue this argument. If you (the Python 
developers) decide that it's too late to accept this, then I won't object to 
reverting it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-19 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

OK sure, there are some other things broken, but they are mostly not dealing 
with string data, but binary data (for example, zlib expects a sequence of 
bytes, not characters).

Just one quick point:

 urllib.urlretrieve(file:///tmp/hé)
 UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters

That's precisely correct behaviour. URLs are not allowed to contain non-ASCII 
characters (that's the whole point of urllib.quote). urllib.quote should accept 
non-ASCII characters (for conversion into ASCII strings). Other URL processing 
functions should not accept non-ASCII characters, since they aren't valid URIs.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-07-18 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Thanks for doing that, Senthil.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8987] Distutils doesn't quote Windows command lines properly

2010-06-12 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

I discovered this investigating a bug report that python-cjson doesn't compile 
properly on Windows (http://pypi.python.org/pypi/python-cjson). Cjson's 
setup.py asks distutils to compile with the flag '-DMODULE_VERSION=1.0.5', 
but distutils.spawn._nt_quote_args is not escaping the quotes correctly.

Specifically, the current behaviour is:
 distutils.spawn._nt_quote_args(['-DMODULE_VERSION=1.0.5'])
['-DMODULE_VERSION=1.0.5']

I expect the following:
 distutils.spawn._nt_quote_args(['-DMODULE_VERSION=1.0.5'])
['-DMODULE_VERSION=1.0.5']

Not surprising, since that function contains a big comment:
# XXX this doesn't seem very robust to me -- but if the Windows guys
# say it'll work, I guess I'll have to accept it.  (What if an arg
# contains quotes?  What other magic characters, other than spaces,
# have to be escaped?  Is there an escaping mechanism other than
# quoting?)

It only escapes spaces, and that's it. I've proposed a patch which escapes the 
following characters properly: ()^| (as far as I can tell, these are the 
reserved characters on Windows).

Note: I did not escape * or ?, the wildcard characters. As far as I can tell, 
these only have special meaning on the command-line itself, and not when 
supplied to a program.

Alternatively, it could call subprocess.list2cmdline (but there seem to be 
issues with that: http://bugs.python.org/issue8972).

--
assignee: tarek
components: Distutils
files: spawn.patch
keywords: patch
messages: 107722
nosy: mgiuca, tarek
priority: normal
severity: normal
status: open
title: Distutils doesn't quote Windows command lines properly
versions: Python 2.6
Added file: http://bugs.python.org/file17653/spawn.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8987
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8987] Distutils doesn't quote Windows command lines properly

2010-06-12 Thread Matt Giuca

Changes by Matt Giuca matt.gi...@gmail.com:


--
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8987
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8821] Range check on unicode repr

2010-05-25 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last 
character of the string is the start of a UTF-16 surrogate pair (between 
'\ud800' and '\udfff'), there is a slight overrun problem. For example:

 repr(u'abcd\ud800')

Upon reading ch = 0xd800, the test (ch = 0xD800  ch  0xDC00) succeeds, and 
it then reads ch2 = *s++. Note that preceding this line, s points at one 
character past the end of the string, so the value read will be garbage. I 
imagine that unless it falls on a segment boundary, the worst that could happen 
is the character '\ud800' is interpreted as some other wide character. 
Nevertheless, this is bad.

Note that *technically* this is never bad, because _PyUnicode_New allocates an 
extra character and sets it to '\u', and thus the above example will always 
set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely 
on, especially given the comment above _PyUnicode_New:

/* We allocate one more byte to make sure the string is
   Ux terminated -- XXX is this needed ?
*/

I thought about removing that XXX, but I'd rather fix the problem. Therefore, I 
have attached a patch which does a range check before reading ch2:

--- Objects/unicodeobject.c (revision 81539)
+++ Objects/unicodeobject.c (working copy)
@@ -3065,7 +3065,7 @@
 }
 #else
 /* Map UTF-16 surrogate pairs to '\U00xx' */
-else if (ch = 0xD800  ch  0xDC00) {
+else if (ch = 0xD800  ch  0xDC00  size  0) {
 Py_UNICODE ch2;
 Py_UCS4 ucs;

Also affects Python 3.

--
components: Unicode
files: unicode-range-check.patch
keywords: patch
messages: 106506
nosy: mgiuca
priority: normal
severity: normal
status: open
title: Range check on unicode repr
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3
Added file: http://bugs.python.org/file17465/unicode-range-check.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8821
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-15 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Thanks very much. Importantly, note that unquote is currently duplicated 
between urllib and urlparse. I have a bug on it (#8143) but in the meantime, 
you will have to commit this fix to both modules.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8143] urlparse has a duplicate of urllib.unquote

2010-03-15 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

What about the alternative (newmodule) patch? That doesn't have threading 
issues, or break backwards compatibility.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8143
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

I've finally gotten around to a complete analysis of this code. I have a 
code/test/documentation patch which fixes the issue without any code breakage.

There is another bug in quote which I've found and fixed with this patch: If 
the 'safe' parameter is unicode, it raises a UnicodeDecodeError.

I have backported all of the 'quote' test cases from Python 3 (which I wrote) 
to Python 2. This exposed the reported bug as well as the above one. It's good 
to have a much larger set of test cases to work with. It tests things like all 
combinations of str/unicode, as well as non-ASCII byte string input and all 
manner of unicode inputs.

The bugfix itself comes from Python 3 (this has already been approved, over 
many months, by Guido, so I am hoping a similar change can get pushed through 
into Python 2 fairly easily). The solution is to add encoding and errors 
arguments to 'quote', and have quote encode the unicode string before anything 
else. 'encoding' defaults to 'utf-8'. So:

 quote(u'/El Niño/')
'/El%20Ni%C3%B1o/'

which is typically the desired behaviour. (Note that URI syntax does not cover 
Unicode strings; it merely says to encode them with some encoding, recommended 
but not required UTF-8, and then percent-encode those.)

With this patch, quote *always* returns a str, even on unicode input. I think 
that makes sense, because a URI is, by definition, an ASCII string. It could 
easily be made to return a unicode instead.

The other fix is for 'safe'. If 'safe' is a byte string we don't touch it. But 
if it is a Unicode string, we throw away all non-ASCII bytes. This means you 
can't make *characters* safe, only *bytes*, since URIs deal with bytes. In 
Python 3, we go further and throw away all non-ASCII bytes from 'safe' as well, 
so you can only make ASCII bytes safe. For this patch, I didn't go that far, 
for backwards compatibility reasons.

Also updated documentation.

In summary, this patch makes 'quote' fully Unicode compliant. It does not 
change any existing behaviour which wouldn't previously have thrown an 
exception, so it can't possibly break any existing code (unless it's relying on 
the exception being thrown).

(A minor change I made was replacing the line cachekey = (safe, always_safe) 
with cachekey = safe. This avoids unnecessary work of hashing always_safe and 
the tuple, since always_safe doesn't change. It doesn't affect the behaviour.)

Note: I've also backported the 'unquote' test cases from Python 3 and found a 
few more bugs. I'm going to report them separately, with patches.

--
keywords: +patch
Added file: http://bugs.python.org/file16539/urllib-quote.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

urllib.unquote fails to decode a percent-escape with mixed case. To demonstrate:

 unquote(%fc)
'\xfc'
 unquote(%FC)
'\xfc'
 unquote(%Fc)
'%Fc'
 unquote(%fC)
'%fC'

Expected behaviour:

 unquote(%Fc)
'\xfc'
 unquote(%fC)
'\xfc'

I actually fixed this bug in Python 3, at Guido's request as part of the huge 
fix to issue 3300. To quote Guido:

 # Maps lowercase and uppercase variants (but not mixed case).
 That sounds like a disaster.  Why would %aa and %AA be correct but
 not %aA and %Aa?  (Even though the old code had the same problem.)

(Indeed, the RFC 3986 allows mixed-case percent escapes.)

I have attached a patch which fixes it simply by removing the dict mapping all 
lower and uppercase variants to characters, and simply calling int(item[:2], 
16). It's slower, but correct. This is the same solution we used in Python 3.

I've also backported a number of test cases from Python 3 which cover this 
issue, and also legitimate bad percent encoding.

Note: I've also backported the remainder of the 'unquote' test cases from 
Python 3 but I found another bug, so I will report that separately, with a 
patch.

--
components: Library (Lib)
files: urllib-unquote-mixcase.patch
keywords: patch
messages: 101044
nosy: mgiuca
severity: normal
status: open
title: urllib.unquote doesn't decode mixed-case percent escapes
type: behavior
versions: Python 2.6, Python 2.7
Added file: http://bugs.python.org/file16540/urllib-unquote-mixcase.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

The 'unquote' function has some very strange behaviour on Unicode input. My 
proposed fix will, I am sure, be contentious, because it could change existing 
behaviour (only on unicode strings), but I think it's worth it for a sane 
unquote function.

Some historical context: I already reported this bug in Python 3 as 
http://bugs.python.org/issue3300 (or part of it). We argued for two months. I 
rewrote the function, and Guido accepted it. The same bugs are present in 
Python 2, but less urgent since they only affect 'unicode' strings (whereas in 
Python 3 they affected all strings).

PROBLEM DESCRIPTION

The basic problem is this:
Current behaviour:
 urllib.unquote(u%CE%A3)
u'\xce\xa3'
(or u'Σ')

Desired behaviour:
 urllib.unquote(u%CE%A3)
'\xce\xa3'
(Which decodes with UTF-8 to u'Σ')

Basically, if you give unquote a unicode string, it will give you back a 
unicode string, with all of the percent-escapes decoded with Latin-1. This 
behaviour was added in r39728. The following line was added:

res[i] = unichr(int(item[:2], 16)) + item[2:]

It takes a percent-escape (e.g., CE), converts it to an int (e.g., 0xCE), 
then calls unichr to form a Unicode character with that codepoint (e.g., 
u'\u00ce'). That's totally wrong. A URI percent-escape is used to represent a 
data octet [RFC 3986], not a Unicode code point.

I would argue that the result of unquote should always be a str, no matter the 
input. Since a URI represents a byte sequence, not a character sequence, 
unquote of a unicode should return a byte string, which the user can then 
decode as desired.

Note that in Python 3 we didn't have a choice, since all strings are unicode, 
we used a much more complicated solution. But we also added a function 
unquote_to_bytes. Python 2's unquote should behave like Python 3's 
unquote_to_bytes.

PROPOSED SOLUTION

To that end, my proposed solution is simply to encode the input unicode string 
with UTF-8, which is exactly what Python 3's unquote_to_bytes function does. I 
have attached a patch which does this. It is thoroughly tested and documented. 
However, see the discussion of potential problems later.

WHY THE CURRENT BEHAVIOUR IS BAD

I'll also point out that the patch in r39728 which created this problem also 
added a test case, still there, which demonstrates just how confusing this 
behaviour is:

r = urllib.unquote(u'br%C3%BCckner_sapporo_20050930.doc')
self.assertEqual(r, u'br\xc3\xbcckner_sapporo_20050930.doc')

This takes a string, clearly meant to be a UTF-8-encoded percent-escaped string 
for u'brückner_sapporo_20050930.doc', and unquotes it. Because of this bug, it 
is decoded with Latin-1 to the string 'brückner_sapporo_20050930.doc'. And 
this garbled string is *actually the expected output of the test case*!!

Furthermore, this behaviour is very confusing because it breaks equality of 
ASCII str and unicode objects. Consider:

 %C3%BC == u%C3%BC
True
 urllib.unquote(%C3%BC)
'\xc3\xbc'
 urllib.unquote(u%C3%BC)
u'\xc3\xbc'
 urllib.unquote(%C3%BC) == urllib.unquote(u%C3%BC)
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
False

Why should the ASCII str object %C3%BC encode to one value, while the ASCII 
unicode object u%C3%BC encode to another? The two inputs represent the same 
string, so they should produce the same output.

POTENTIAL PROBLEMS

The attached patch will not, to my knowledge, affect any calls to unquote with 
str input. It only changes unicode input. Since this was buggy anyway, I think 
it's a legitimate fix.

I am, however, concerned about code breakage for existing code which uses 
unicode strings and depends upon this behaviour. Some use cases:

1. Unquoting a unicode string which is pure ASCII, with pure ASCII 
percent-escapes. This previously would produce a pure ASCII unicode string, now 
produces a pure ASCII str. This shouldn't break anything unless some code 
specifically checks that strings are of type 'unicode' (e.g., the Storm 
database library).
2. Unquoting a unicode string with pure ASCII percent-escapes, but non-ASCII 
characters. This previously would preserve all the unescaped characters; they 
will now be encoded to UTF-8. Technically this should never happen, as URIs are 
not allowed to contain non-ASCII characters [RFC 3986].
3. Unquoting a unicode string which is pure ASCII, with non-ASCII percent 
escapes. Some code may rely on the implicit decoding as Latin-1. However, I 
think it is more likely that existing code would just break, since most URIs 
are UTF-8 encoded.

TWO SOLUTIONS

Having gone through the problems, I imagine that we may reach the conclusion 
that it is too dangerous to fix this bug. Therefore, I am proposing an 
alternate solution (which I will attach in a follow-up comment), which is not 
to change the code at all. Instead, just fix the broken test case and add lots 
more test cases, and also

[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Alternative patch which fixes test cases and documentation without changing the 
behaviour.

--
Added file: http://bugs.python.org/file16542/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Changes by Matt Giuca matt.gi...@gmail.com:


Removed file: http://bugs.python.org/file16542/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

New version of explain patch -- fixed comment linking to the wrong bug ID -- 
now links to this bug ID (#8136).

--
Added file: http://bugs.python.org/file16545/urllib-unquote-explain.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

 Note: I've also backported the remainder of the 'unquote' test cases
 from Python 3 but I found another bug, so I will report that separately,
 with a patch.

Filed under issue #8136.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8143] urlparse has a duplicate of urllib.unquote

2010-03-14 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

urlparse contains a complete copy of the urllib.unquote function. This is 
extremely nasty code duplication -- I have two patches pending on 
urllib.unquote (#8135 and #8136) and I only just realised that I missed 
urlparse.unquote!

The reason given for this is:
Cannot use directly from urllib as it would create circular reference.
urllib uses urlparse methods ( urljoin)

I don't see that as a reason for code duplication. The fix is to make a local 
import of unquote in parse_qsl, like this:

def parse_qsl(qs, keep_blank_values=0, strict_parsing=0):
from urllib import unquote

I am aware that this possibly violates PEP 8 (all imports should be at the top 
of the module), but I'd say this is the lesser of two evils.

A patch is attached. Commit log: urlparse: Removed duplicate of 
urllib.unquote. Replaced with a local import.

--
components: Library (Lib)
files: urlparse-unquote.patch
keywords: patch
messages: 101075
nosy: mgiuca
severity: normal
status: open
title: urlparse has a duplicate of urllib.unquote
versions: Python 2.6, Python 2.7
Added file: http://bugs.python.org/file16550/urlparse-unquote.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8143
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Oh, I just discovered that urlparse contains a copy of unquote, which will also 
need to be patched. I've submitted a patch to remove the duplicate (#8143) -- 
if that is accepted first then there's no need to worry about it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8136] urllib.unquote decodes percent-escapes with Latin-1

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Oh, I just discovered that urlparse contains a copy of unquote, which will also 
need to be patched. I've submitted a patch to remove the duplicate (#8143) -- 
if that is accepted first then there's no need to worry about it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8136
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

I thought more about it, and wrote a different patch which doesn't remove the 
dictionary. I just replaced the dictionary creation code -- now it includes 
keys for all combinations of upper and lower case (for two-letter hex codes). 
This dictionary isn't much bigger -- 484 entries where is previously had 412.

Therefore, here is a replacement patch (urllib-unquote-mixcase.patch2).

--
Added file: http://bugs.python.org/file16551/urllib-unquote-mixcase.patch2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

Changes by Matt Giuca matt.gi...@gmail.com:


Removed file: http://bugs.python.org/file16551/urllib-unquote-mixcase.patch2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8135] urllib.unquote doesn't decode mixed-case percent escapes

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Tiny fix to patch2 -- replaced list comprehension with generator expression in 
dictionary construction.

--
Added file: http://bugs.python.org/file16552/urllib-unquote-mixcase.patch2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8135
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8143] urlparse has a duplicate of urllib.unquote

2010-03-14 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

If this patch is rejected, then at the very least, the urllib.unquote function 
needs a comment at the top explaining that it is duplicated in urlparse, so any 
changes should be made to both.

Note that urlparse.unquote is not a documented function, or in the __all__ 
export list, so people *shouldn't* be using it. But OK, I'll accept that some 
might.

If there is a problem with some kind of race condition importing (I don't see 
how there could be, but I'll accept it if someone confirms), or with people 
using urlparse.unquote directly, then I'd propose an alternate solution which 
removes the circular dependency entirely: Move unquote into a separate module 
_urlunquote, which is imported by both urllib and urlparse. No code breakage.

Patch attached. Commit log: Fixed duplication of urllib.unquote in urlparse. 
Moved function to a separate module _urlunquote.

--
Added file: http://bugs.python.org/file16553/urlparse-unquote-newmodule.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8143
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5827] os.path.normpath doesn't preserve unicode

2009-11-18 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Thanks Ezio.

I've updated the patch to incorporate your suggestions.

Note that I too have only tested it on Linux, but I tested both
posixpath and ntpath (and there is no OS-specific code, except for the
filenames themselves).

I'm not sure if using assertTrue(isinstance ...) is better than
assertEqual(type ...), because the type equality checking produces this
error:
AssertionError: type 'str' != type 'unicode'
while isinstance produces this unhelpful error:
AssertionError: False is not True

But oh well, I made the change anyway as most test cases use isinstance.

--
Added file: http://bugs.python.org/file15362/normpath.2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5827
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1712522] urllib.quote throws exception on Unicode URL

2009-05-26 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

The issue of urllib.quote was discussed at extreme length in issue 3300,
which was specific to Python 3.
http://bugs.python.org/issue3300

In the end, I rewrote the entire family of urllib.quote and unquote
functions; they're now Unicode compliant and accept additional encoding
and errors arguments to handle this.

They were never backported to the 2.x branch; maybe we should do so.
Note that the code is quite different and considerably more complex due
to all the new issues with Unicode strings.

--
nosy: +mgiuca

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1712522
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6118] urllib.parse.quote_plus ignores optional arguments

2009-05-26 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

urllib.parse.quote_plus will ignore its encoding and errors arguments if
its input string has a space in it.

Intended behaviour:
 urllib.parse.quote_plus(\xa2\xd8 \xff, encoding='latin-1')
'%A2%D8+%FF'
Observed behaviour:
 urllib.parse.quote_plus(\xa2\xd8 \xff, encoding='latin-1')
'%C2%A2%C3%98+%C3%BF'
(This just uses the default UTF-8 encoding).

Attached patch with test cases. This only affects Python 3.x (the 2.x
branch has no encoding/errors argument).

--
components: Library (Lib)
files: urllib_quote_plus.patch
keywords: patch
messages: 88368
nosy: mgiuca
severity: normal
status: open
title: urllib.parse.quote_plus ignores optional arguments
type: behavior
versions: Python 3.0, Python 3.1, Python 3.2
Added file: http://bugs.python.org/file14081/urllib_quote_plus.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6118
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3613] base64.encodestring does not actually accept strings

2009-04-23 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

I've attached a patch which renames encodestring to encodebytes (keeping
encodestring around as an alias). Updated test and documentation.

I also renamed decodestring to decodebytes, because it also refuses to
accept a string (only a bytes). I have an alternative suggestion, which
I'll post in a separate comment (in a minute).

--
Added file: http://bugs.python.org/file13753/encodestring_rename.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3613
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3613] base64.encodestring does not actually accept strings

2009-04-23 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Now, base64.encodestring and decodestring seem a bit weird because the
Base64 encoded string is also required to be a bytes.

It seems to me that once something is Base64-encoded, it's considered to
be ASCII text, not just some byte string, and therefore it should be a
str, not a bytes. (For example, they end with a '\n'. That's something
which strings do, not bytes).

Hence, base64.encodestring (which should be encodebytes) should take a
bytes and return a str. base64.decodestring should take a str (required
to be ASCII-only) and return a bytes.

I've attached an alternative patch, encodebytes_new_types.patch (which,
unlike my other patch, doesn't rename decodestring to decodebytes). This
patch:

- Renames encodestring to encodebytes.
- Changes the output of encodebytes to return an ASCII str*, not a bytes.
- Changes the input of decodestring to accept an ASCII str, not a bytes.

* An ASCII str is a Unicode string with only ASCII characters.

This isn't a proper patch (it breaks a lot of other code which I haven't
bothered to fix). I'm just submitting it as an idea, in case this is
something we want to do. Most likely not, due to the breakage. Also we
have the same problem for the non-legacy functions, b64encode and
b64decode, etc, so the problem is more widespread than just these two
functions.

--
Added file: http://bugs.python.org/file13754/encodebytes_new_types.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3613
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2009-04-23 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

OK since the patches I submitted are now eight months old, I just did an
update and re-applied them. I am submitting new patch files which don't
change anything, but are patches against revision 71822 (should be much
easier to apply).

I'd still like to see doc+bytesmethods.patch applied (since it fixes
method names which make no sense at all in Python 3.0 context), but
since it's getting a bit late for that, I'll be happy for the doc-only
patch to be accepted (which merely corrects the documentation which is
still using Python 2.x terminology).

--
Added file: http://bugs.python.org/file13755/doc-only.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2009-04-23 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Full method renaming patch.

--
Added file: http://bugs.python.org/file13756/doc+bytesmethods.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5827] os.path.normpath doesn't preserve unicode

2009-04-23 Thread Matt Giuca

New submission from Matt Giuca matt.gi...@gmail.com:

In the Python 2.x branch, os.path.normpath will sometimes return a str
even if given a unicode. This is not an issue in the Python 3.0 branch.

This happens specifically when it throws away all string data and
constructs its own:

 os.path.normpath(u'')
'.'
 os.path.normpath(u'.')
'.'
 os.path.normpath(u'/')
'/'

This is a problem if working with code which expects all strings to be
unicode strings (sometimes, functions raise exceptions if given a str,
when expecting a unicode).

I have attached patches (with test cases) for posixpath and ntpath which
correctly preserve the unicode-ness of the input string, such that the
new behaviour is:

 os.path.normpath(u'')
u'.'
 os.path.normpath(u'.')
u'.'
 os.path.normpath(u'/')
u'/'

I tried it on os2emxpath and plat-riscos/riscospath (the other two
OS-specific path modules), and it already worked fine for them.
Therefore, this patch fixes all necessary OS-specific versions of os.path.

--
components: Library (Lib), Unicode
files: normpath.patch
keywords: patch
messages: 86395
nosy: mgiuca
severity: normal
status: open
title: os.path.normpath doesn't preserve unicode
versions: Python 2.6, Python 2.7
Added file: http://bugs.python.org/file13757/normpath.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5827
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5827] os.path.normpath doesn't preserve unicode

2009-04-23 Thread Matt Giuca

Changes by Matt Giuca matt.gi...@gmail.com:


--
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5827
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2009-04-23 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

I agree with that -- too big a change to make now.

But can we please get the documentation patch accepted? It's been
waiting here for eight months with corrections to clearly-incorrect
documentation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2009-03-17 Thread Matt Giuca

Matt Giuca matt.gi...@gmail.com added the comment:

Note that, irrespective of the changes to the library itself, the
documentation is out of date since it still uses the old
string/unicode nomenclature, rather than the new bytes/string. I
have provided a separate documentation patch which should be applicable
with relatively little fuss.

(It's from August so it will probably conflict, but I can update it if
necessary).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3803] Comparison operators - New rules undocumented in Python 3.0

2008-09-08 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

I've noticed that in Python 3.0, the , , = and = operators now raise
a TypeError when comparing objects of different types, rather than
ordering them consistently but arbitrarily. The documentation doesn't
yet reflect this behaviour.

The current documentation says:
(This unusual definition of comparison was used to simplify the
definition of operations like sorting and the in and not in operators.
In the future, the comparison rules for objects of different types are
likely to change.)

I assume this is the change it's warning us about.

Hence I propose this patch to reference/expressions.rst, which removes
the above quoted paragraph and describes the new TypeError-raising
behaviour. My text is as follows:

The objects need not have the same type. If both are numbers, they are
converted to a common type. Otherwise, the == and != operators always
consider objects of different types to be unequal, while the , , =
and = operators raise a TypeError when comparing objects of different
types.

--
assignee: georg.brandl
components: Documentation
files: expressions.patch
keywords: patch
messages: 72767
nosy: georg.brandl, mgiuca
severity: normal
status: open
title: Comparison operators - New rules undocumented in Python 3.0
versions: Python 3.0
Added file: http://bugs.python.org/file11421/expressions.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3803
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3793] Small RST fix in datamodel.rst

2008-09-06 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

A missing blank line under the heading for __bool__ in datamodel.rst (in
Python 3.0 docs) causes the following line to appear in the output HTML.

.. index:: single: __len__() (mapping object method)

Visible here:
http://docs.python.org/dev/3.0/reference/datamodel.html#object.__bool__

Fixed in attached patch by adding a blank line.

Commit log:
Added blank line to avoid RST source leaking into HTML output.

--
assignee: georg.brandl
components: Documentation
files: patch
messages: 72668
nosy: georg.brandl, mgiuca
severity: normal
status: open
title: Small RST fix in datamodel.rst
versions: Python 3.0
Added file: http://bugs.python.org/file11405/patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3793
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-09-03 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Can I just remind people that I have a documentation patch ready here
(and has been for about a month)?

Of course the doc+bytesmethods.patch may be debatable and probably too
late to go in 3.0. But you should be able to commit doc-only.patch with
no problems.

Current array documentation
(http://docs.python.org/dev/3.0/library/array.html) is clearly wrong in
Python 3.0 (even containing syntax errors).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue600362] relocate cgi.parse_qs() into urlparse

2008-08-25 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

It seems like parse_multipart and parse_header are very strongly related
to parse_qs. (eg. if you want to process HTTP requests you'll want to
call parse_qs for x-www-form-urlencoded and parse_multipart for
multipart/form-data).

Should these be moved too? (They aren't part of the url syntax though,
so it doesn't make sense for them to be in urlparse).

--
nosy: +mgiuca

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue600362
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3613] base64.encodestring does not actually accept strings

2008-08-20 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Hi Dmitry,

RE the method behaviour: I think it probably is correct to NOT accept a
string. Given that it's base64 encoding it, it only makes sense to
encode bytes, not arbitrary Unicode characters which have no
well-defined binary representation.

RE the method name: I agree, it should be renamed to encodestring. I
argued a similar case for the array.tostring and fromstring methods
(which actually act on bytes in Python 3.0) - here:
http://bugs.python.org/issue3565. So far nobody replied on that issue; I
think it may be too late to rename them. Best we can do is document them.

RE xmlrpc.client:1168. We just checked in a patch to urllib which adds
an unquote_to_bytes function (see
http://docs.python.org/dev/3.0/library/urllib.parse.html#urllib.parse.unquote_to_bytes).
(Unquote itself still returns a string). It should be correct to just
change xmlrpc.client:1168 to call urllib.parse.unquote_to_bytes. (Though
I've not tested it).

--
nosy: +mgiuca

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3613
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-20 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Thanks for pointing that out, Antoine. I just commented on that bug.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3613] base64.encodestring does not actually accept strings

2008-08-20 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

  it should be renamed to encodestring
 Huh ? It is already called that :)

Um ... yes. I mean encodebytes :)

  Best we can do is document them.
 Oh well.

But I don't know the rules. People are saying things like no new
features after beta3 but I take it that
backwards-compatibility-breaking changes are included in this.

But maybe it's still OK for us to break code after the beta. Perhaps
someone involved in the release can comment on this issue (and hopefully
with a view to my array patch - http://bugs.python.org/issue3565 - as well).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3613
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-20 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

A similar issue came up in another bug
(http://bugs.python.org/issue3613), and Guido said:

IMO it's okay to add encodebytes(), but let's leave encodestring()
around with a deprecation warning, since it's so late in the release cycle.

I think that's probably wise RE this bug as well - my original
suggestion to REPLACE tostring/fromstring with tobytes/frombytes was
probably a bit over-zealous.

I'll have another go at this during some spare cycles tomorrow -
basically taking my current patch and adding tostring/fromstring back
in, to call tobytes/frombytes with deprecation warnings. Does this sound
like a good plan?

(Also policy question: When you have deprecated functions, how do you
document them? I assume you say deprecated in the docs; is there a
standard template for this?)

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3609] does parse_header really belong in CGI module?

2008-08-19 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

These functions are for generic MIME headers and bodies, so are
applicable to CGI, HTTP, Email, and any other protocols based on MIME.
So I think having them in email.header makes about as much sense as
having them in cgi.

Isn't mimetools a better package for this?

Also I think there's an exodus of functions from cgi -- there's talk
about parse_qs/parse_qsl being moved to urllib (I thought that was
almost finalised). Traditionally the cgi module has had way too much
stuff in it which only superficially applies to cgi.

I'm also thinking of cgi.escape, which I'd rather see in htmllib than
cgi (except that htmllib is described as A parser for HTML documents).

But I'm worried that these functions are too ingrained in people's
memories (I type cgi.escape several times a day and I'd get confused
if it moved). So perhaps these moves are too late.

I imagine if they were moved (at least for a few versions) the old ones
would still work, with a deprecation warning?

--
nosy: +mgiuca

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3609
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-18 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Hi,

Sorry to bump this, but you (Guido) said you wanted this closed by
Wednesday. Is this patch committable yet? (There are no more unresolved
issues that I am aware of).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3576] Demo/embed builds against old version

2008-08-17 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

The Python 2.6 version of Demo/embed/Makefile builds against
libpython2.5.a, which doesn't exist in this version.

Quick patch to let it build against libpython2.6.a.

Commit log:

Fixed Demo/embed/Makefile to build against libpython2.6.a.

--
components: Build
files: embed.makefile.patch
keywords: patch
messages: 71264
nosy: mgiuca
severity: normal
status: open
title: Demo/embed builds against old version
type: compile error
versions: Python 2.6
Added file: http://bugs.python.org/file11136/embed.makefile.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3576
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-16 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

A few weeks ago I fixed the struct module's documentation which wasn't
3.0 compliant (basically renaming strings to bytes and unicode to
string). Now I've had a look at the array module, and it's got similar
problems.

http://docs.python.org/dev/3.0/library/array.html

Unfortunately, the method names are wrong as far as Py3K is concerned.
tostring returns what is now called a bytes, and tounicode returns
what is now called a string.

There are a few other errors in the documentation too, like the 'c' type
code (which no longer exists, but is still documented), and examples
using Python 2 syntax. Those are trivial to fix.

I suggest a 3-step process for fixing this:
1. Update the documentation to describe the 3.0 behaviour using 3.0
terminology, even though the method names are wrong (I've done this
already).
2. Rename tostring and fromstring methods to tobytes and
frombytes. I think this is quite important as the value being returned
can no longer be described as a string.
3. Rename tounicode and fromunicode methods to tostring and
fromstring. I think this is less important, as the name unicode
isn't ambiguous, and potentially undesirable, as we'd be re-using method
names which previously did something else.

I'm aware we've got the final beta in 4 days, and there's no way my
phase 2-3 can be done after that. I think we should aim to do phase 2,
but probably not phase 3.

I've fixed the documentation to accurately describe the current
behaviour, using Python 3 terminology. This doesn't change any behaviour
at all, so it should be able to be committed immediately.

I'll have a go at a phase 2 patch shortly. Is it feasible to even
think about renaming a method at this stage?

Commit log:

Doc/library/array.rst, Modules/arrayobject.c:

Updated array module documentation to be Python 3.0 compliant.

* Removed references to 'c' type code (no longer valid).
* References to string changed to bytes.
* References to unicode changed to string.
* Updated examples to use Python 3.0 syntax (and show the output of
evaluating them).

--
assignee: georg.brandl
components: Documentation, Interpreter Core
files: doc-only.patch
keywords: patch
messages: 71201
nosy: georg.brandl, mgiuca
severity: normal
status: open
title: array documentation, method names not 3.0 compliant
versions: Python 3.0
Added file: http://bugs.python.org/file11121/doc-only.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



=?utf-8?q?[issue3565]_array_documentation, =09method_names_not_3.0_compliant?=

2008-08-16 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 I'm not a native speaker (of English), but my understanding is that the
 noun string, in itself, can very well be used to describe this type:
 the result is a byte string, as opposed to a character string.
 Merriam-Webster's seems to agree; meaning 5b(2) is a sequence of like
 items (as bits, characters, or words)

Ah yes, that's quite right (and computer science literature will
strongly support that claim as well).

However the word string, unqualified, and in Python 3.0 terminology
(as described in PEP 358) now refers only to the str type (formerly
known as unicode), so it is very confusing to have a method tostring
which returns a bytes object.

For array to become a good Py3k citizen, I'd strongly argue that
tostring/fromstring should be renamed to tobytes/frombytes. I'm
currently writing a patch for that - it looks like there's very minimal
damage.

However as a separate issue, I think the documentation update should be
approved first.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-16 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

(Fixed issue title)

--
title: array documentation, method names not 3.0 compliant - array 
documentation, method names not 3.0 compliant

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-16 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

I renamed tostring/fromstring to tobytes/frombytes in the array module,
as described above. I then grepped the entire py3k tree for tostring
and fromstring, and carefully replaced all references which pertain to
array objects.

The relatively minor number of these references suggests this won't be a
big problem. All the test cases pass.

I haven't (yet) renamed tounicode/fromunicode to tostring/fromstring.
The more I think about it, the more that sounds like a bad idea (and
could create confusion as to whether this is a character string or byte
string, as Martin pointed out).

The patch (doc+bytesmethods.patch) does both the original
doc-only.patch, plus the renaming and updating of all usages. Use the
above commit log, plus:

Renamed array.tostring to array.tobytes, and array.fromstring to
array.frombytes, to reflect the Python 3.0 terminology.

Updated all references to these methods in Lib to the new names.

Added file: http://bugs.python.org/file11122/doc+bytesmethods.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-16 Thread Matt Giuca

Changes by Matt Giuca [EMAIL PROTECTED]:


Removed file: http://bugs.python.org/file11122/doc+bytesmethods.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3565] array documentation, method names not 3.0 compliant

2008-08-16 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Oops .. forgot to update the array.rst docs with the new method names.
Replaced doc+bytesmethods.patch with a fixed version.

Added file: http://bugs.python.org/file11123/doc+bytesmethods.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3565
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3564] making partial functions comparable

2008-08-16 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

It's highly debatable whether these should compare true. (Note: saying
they aren't comparable is a misnomer -- they are comparable, they just
don't compare equal).

From a mathematical standpoint, they *are* equal, but it's impossible
(undecidable) to get pure equality of higher order values (functions) in
a programming language (because strictly, any two functions which give
the same results for the same input are equal, but it's undecidable
whether any two functions will give the same results for all inputs). So
we have to be conservative (false negatives, but no false positives).

In other words, should these compare equal?

 (lambda x: x + 1) == (lambda x: x + 1)
False (even though technically they describe the same function)

I would argue that if you call functools.partial twice, separately, then
you are creating two function objects which are not equal.

I would also argue that functools.partial(f, arg1, ..., argn) should be
equivalent to lambda *rest: f(arg1, ..., argn, *rest). Hence your example:
 def foo(): pass
 f1=functools.partial(foo)
 f2=functools.partial(foo)

Is basically equivalent to doing this:

 def foo(): pass
 f1 = lambda: foo()
 f2 = lambda: foo()

Now f1 and f2 are not equal, because they are two separately defined
functions.

I think you should raise this on Python-ideas, instead of as a bug report:
http://mail.python.org/pipermail/python-ideas/

But with two identical functions comparing INEQUAL if they were created
separately, I see no reason for partial functions to behave differently.

--
nosy: +mgiuca

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3564
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3547] Ctypes is confused by bitfields of varying integer types

2008-08-15 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Confirmed in HEAD for Python 2.6 and 3.0, on Linux.

Python 2.6b2+ (trunk:65708, Aug 16 2008, 15:04:13) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2

Python 3.0b2+ (py3k:65708, Aug 16 2008, 15:09:19) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2

I was also able to simplify the test case. I get this issue just using
one c_short and one c_long, with nonstandard bit lengths. eg:

fields = [('a', c_short, 16), ('b', c_long, 16)]

(sizeof(c_short) == 2 and sizeof(c_long) == 4).

But it's somewhat sporadic under which conditions it happens and which
it doesn't.

One might imagine this was a simple calculation. But the _ctypes module
is so big (5000 lines of C); at an initial glance I can't find the code
responsible! Any hints? (Modules/_ctypes/ctypes.c presumably is where
this takes place).

--
nosy: +mgiuca
versions: +Python 2.6, Python 3.0

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3547
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Ah cheers Antoine, for the tip on using defaultdict (I was confused as
to how I could access the key just by passing defaultfactory, as the
manual suggests).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

OK I implemented the defaultdict solution. I got curious so ran some
rough speed tests, using the following code.

import random, urllib.parse
for i in range(0, 10):
str = ''.join(chr(random.randint(0, 0x10)) for _ in range(50))
quoted = urllib.parse.quote(str)

Time to quote 100,000 random strings of 50 characters.
(Ran each test twice, worst case printed)

HEAD, chars in range(0,0x11): 1m44.80
HEAD, chars in range(0,256): 25.0s
patch9, chars in range(0,0x11): 35.3s
patch9, chars in range(0,256): 27.4s
New, chars in range(0,0x11): 31.4s
New, chars in range(0,256): 25.3s

Head is the current Py3k head. Patch 9 is my previous patch (before
implementing defaultdict), and New is after implementing defaultdict.

Interesting. Defaultdict didn't really make much of an improvement. You
can see the big help the cache itself makes, though (my code caches all
chars, whereas the HEAD just caches ASCII chars, which is why HEAD is so
slow on the full repertoire test). Other than that, differences are
fairly negligible.

However, I'll keep the defaultdict code, I quite like it, speedy or not
(it is slightly faster).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3552] uuid - exception on uuid3/uuid5

2008-08-14 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

The test suite breaks on the Lib/test/test_uuid.py, as of r65661. This
is because uuid3 and uuid5 now raise exceptions.

TypeError: new() argument 1 must be bytes or read-only buffer, not bytearray

The problem is due to the changes in the way s# now expects a
read-only buffer in PyArg_ParseTupleAndKeywords. (Which was changed in
r65661).

A rundown of the problem:

Lib/uuid.py:553 (in uuid.uuid3):
hash = md5(namespace.bytes + bytes(name, utf-8)).digest()

namespace.bytes is a bytearray, so the argument to md5 is a bytearray.

Modules/md5module.c:517 (in _md5.md5.new):
if (!PyArg_ParseTupleAndKeywords(args, kwdict, |s#:new, kwlist,

Using s# now requires a read-only buffer, so this raises a TypeError.

The same goes for uuid5 (which calls _sha1.sha1, and has exactly the
same problem).

The commit log for r65561 suggests changing some s# into s* (which
allows readable buffers). I don't understand the ramifications here
(some problem with threading), and when I made that change, it seg
faulted, so I'll leave well enough alone. But for someone who knows more
what they're doing, that may be a more root-of-the-problem fix.

In the meantime, I propose this simple patch to fix uuid: I think
namespace.bytes should actually return a bytes, not a bytearray, so I'm
modifying it to return a bytes.

Related issue:
http://bugs.python.org/issue3139

Patch for r65675.
Commit log:

Fixed TypeError raised by uuid.uuid3 and uuid.uuid5, by passing a
bytearray to hash functions. Now namespace.bytes returns a bytes instead
of a bytearray.

--
components: Library (Lib)
files: uuid.patch
keywords: patch
messages: 71129
nosy: mgiuca
severity: normal
status: open
title: uuid - exception on uuid3/uuid5
type: compile error
versions: Python 3.0
Added file: http://bugs.python.org/file0/uuid.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3552
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).

Another update on the remaining outstanding issues:

Resolved issues since last time:

 Should unquote accept a bytes/bytearray as well as a str?
No. But see below.

 Lib/email/utils.py:
 Should encode_rfc2231 with charset=None accept strings with non-ASCII
 characters, and just encode them to UTF-8?
Implemented Antoine's fix (or 'ascii').

 Should quote accept safe characters outside the
 ASCII range (thereby potentially producing invalid URIs)?
No.

New issues:

unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str-bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.

I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding=latin-1, which is the same behaviour as the current head.

That's all the issues I have left over in this patch.

Attaching patch 10 (for revision 65675).

Commit log for patch 10:

Fix for issue 3300.

urllib.parse.unquote:
  Added encoding and errors optional arguments, allowing the caller
  to determine the decoding of percent-encoded octets.
  As per RFC 3986, default is utf-8 (previously implicitly decoded
  as ISO-8859-1).
  Fixed a bug in which mixed-case hex digits (such as %aF) weren't
  being decoded at all.

urllib.parse.quote:
  Added encoding and errors optional arguments, allowing the
  caller to determine the encoding of non-ASCII characters
  before being percent-encoded.
  Default is utf-8 (previously characters in range(128, 256)
  were encoded as ISO-8859-1, and characters above that as UTF-8).
  Characters/bytes above 128 are no longer allowed to be safe.
  Now allows either bytes or strings.
  Optimised Quoter; now inherits defaultdict.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding=latin-1, to preserve existing behaviour (which the email
module is dependent upon).

Added file: http://bugs.python.org/file1/parse.py.patch10

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-14 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Antoine:
 I think if you move the line defining str out of the loop, relative
 timings should change quite a bit. Chances are that the random
 functions are not very fast, since they are written in pure Python.

Well I wanted to test throwing lots of different URIs to test the
caching behaviour. You're right though, probably a small % of the time
is spent on calling quote.

Oh well, the defaultdict implementation is in patch10 anyway :) It
cleans Quoter up somewhat, so it's a good thing anyway. Thanks for your
help.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3552] uuid - exception on uuid3/uuid5

2008-08-14 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

So are you saying that if I had libopenssl (or whatever the name is)
installed and linked with Python, it would bypass the use of _md5 and
_sha1, and call the hash functions in libopenssl instead? And all the
buildbots _do_ have it linked?

That would indicate that the bots _aren't_ testing the code in _md5 and
_sha1 at all. Perhaps one should be made to?

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3552
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3557] Segfault in sha1

2008-08-14 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

Continuing the discussion from Issue 3552
(http://bugs.python.org/issue3552).

r65676 makes changes to Modules/md5module.c and Modules/sha1module.c, to
allow them to read mutable buffers.

There's a segfault in sha1module if given 0 arguments. eg:

 import _sha1
 _sha1.sha1()
Segmentation fault

Docs here suggest this should be OK:
http://docs.python.org/dev/3.0/library/hashlib.html

This crashes on the Lib/test/test_hmac.py test case, but apparently
(according to Margin on issue 3552) none of the build bots see it
because they use libopenssl and completely bypass the _md5 and _sha1
modules. Also there are no direct test cases for either of these modules.

This is because new code in r65676 doesn't initialise a pointer to NULL.
Fixed in patch (as well as replaced tab with spaces for consistency, in
both modules).

I strongly recommend that a) A build bot be made to use _md5 and _sha1
instead of OpenSSL (or they aren't running that code at all), AND/OR b)
Direct test cases be written for _md5 and _sha1.

Commit log:

Fixed crash on _sha1.sha1(), with no arguments, due to not initialising
pointer.

Normalised indentation in md5module.c and sha1module.c.

--
components: Interpreter Core
files: sha1.patch
keywords: patch
messages: 71157
nosy: mgiuca
severity: normal
status: open
title: Segfault in sha1
type: crash
versions: Python 3.0
Added file: http://bugs.python.org/file8/sha1.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3557
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 I have no strong opinion on the very remaining points you listed,
 except that IMHO encode_rfc2231 with charset=None should not try to
 use UTF8 by default. But someone with more mail protocol skills
 should comment :)

OK I've come to the realization that DEMANDING ascii (and erroring on
non-ASCII chars) is better for the short term anyway, because we can
always decide later to relax the restrictions, but it's a lot worse to
add restrictions later. So I agree now, should be ASCII. And no, I don't
have mail protocol skills.

The same goes for unquote accepting bytes. We can decide to make it
accept bytes later, but can't remove that feature later, so it's best
(IMHO) to let it NOT accept bytes (which is the current behaviour).

 The bytes  127 would be translated as themselves; this follows
 logically from how stuff is parsed -- %% and %FF are translated,
 everything else is not. But I don't really care, I doubt there's a
 need.

Ah but what about unquote (to string)? If it accepted bytes then it
would be a bytes-str operation, and then you need a policy on DEcoding
those bytes. It makes things too complex I think.

 I believe patch 9 still has errors defaulting to strict for quote().
 Weren't you going to change that?

I raised it as a concern, but I thought you overruled on that, so I left
it as errors='strict'. What do you want it to be? 'replace'? Now that
this issue has been fully discussed, I'm happy with whatever you decide.

 From looking at it briefly I
 worry that the implementation is pretty slow -- a method call for each
 character and a map() call sounds pretty bad.

Yes, it does sound pretty bad. However, that's the current way of doing
things in both 2.x and 3.x; I didn't change it (though it looks like I
changed a LOT, I really did try to change as little as possible!)
Assuming it wasn't made _slower_ than before, can we ignore existing
performance issues and treat them as a separate matter (and can be dealt
with after 3.0)?

I'm not putting up a new patch now. The only fix I'd make is to add
Antoine's or 'ascii' to email/utils.py, as suggested on the review
tracker. I'll make this change along with any other recommendations
after your review.

(That is Lib/email/utils.py line 222 becomes:
s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
)

btw this Rietveld is amazing. I'm assuming I don't have permission to
upload patches there (can't find any button to do so) which is why I
keep posting them here and letting you upload to Rietveld ...

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-13 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 I'm OK with replace for unquote() ...
 For quote() I think strict is better 

There's just an odd inconsistency there, but it's only a tiny gotcha;
and I agree with all your other arguments. I'll change unquote back to
errors='replace'.

 This means we have a useful analogy:
 quote(s, e) == quote(s.encode(e)).

That's exactly true, yes.

 Now that you've spent so  much time with this patch, can't you think
 of a faster way of doing this?

Well firstly, you could replace Quoter (the class) with a quoter
function, which is nested inside quote. Would calling a nested function
be faster than a method call?

 I wonder if mapping a defaultdict wouldn't work.

That is a good idea. Then, the function (as I describe above) would be
just the inside of what currently is the except block, and that would be
the default_factory of the defaultdict. I think that should speed things up.

I'm very hazy about what is faster in the bytecode world of Python, and
wary of making a change and proclaiming this is faster! without doing
proper speed tests (which is why I think this optimisation could be
delayed until at least after the core interface changes are made). But
I'll have a go at that change tomorrow.

(I won't be able to work on this for up to 24 hours).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Bill, this debate is getting snipy, and going nowhere. We could argue
about what is the pure and correct thing to do, but we have a
limited time frame here, so I suggest we just look at the important facts.

1. There is an overwhelming consensus (including from me) that a
str-bytes version is acceptable to have in the library (whether or not
it's the correct solution).
2. There is an overwhelming consensus (including from you) that a
str-str version is acceptable to have in the library (whether or not
it's the correct solution).
3. By default, the str-str version breaks much less code, so both of us
decided to use it by default.

To this end, both of our patches:

1. Have a str-bytes version available.
2. Have a str-str version available.
3. Have quote and unquote functions call the str-str version.

So it seems we have agreed on that. Therefore, there should be no more
arguing about which is more right.

So all your arguments seem to be essentially saying the str-bytes
methods work perfectly; I don't care about if the str-str methods are
correct or not. The fact that your string versions quote UTF-8 and
unquote Latin-1 shows just how un-seriously you take the str-str methods.

Well the fact is that a) a great many users do NOT SHARE your ideals and
will default to using quote and unquote rather than the bytes
functions, and b) all of the rest of the library uses quote and
unquote. So from a practical sense, how these methods behave is of the
utmost importance - they are more important than any new functions we
introduce at this point.

For example, the cgi.FieldStorage and the http.server modules will
implicitly call unquote and quote.

That means whether you, or I, or Guido, or The King Of The Internet
likes it or not, we have to have a most reasonable solution to the
problem of quoting and unquoting strings.

 Good thing we don't need to [handle unescaped non-ASCII characters in
 unquote]; URIs consist of ASCII characters.

Once again, practicality beats purity. I'd argue that it's a *good* (not
strictly required) idea to not mangle input unless we have to.

  * Question: How does unquote_bytes deal with unescaped characters?

 Not sure I understand this question...

I meant unescaped non-ASCII characters, as discussed above (eg.
unquote_bytes('\u0123')).

 Your test cases probably aren't testing things I feel it's necessary
 to test. I'm interested in having the old test cases for urllib
 pass, as well as providing the ability to unquote_to_bytes().

I'm sorry, but you're missing the point of test-driven development. If
you think there is a bug, you don't just fix it and say look, the old
test cases still pass! You write new FAILING test cases to demonstrate
the bug. Then you change the code to make the test cases pass. All your
test suite proves is that you're happy with things the way they are.

 Matt, your patch is not some God-given thing here.

No, I am merely suggesting that it's had a great deal more thought put
into it -- not just my thought, but all the other people in the past
month who've suggested different approaches and brought up discussion
points. Including yourself -- it was your suggestion in the first place
to have the str-bytes functions, which I agree are important.

  snip - Quote uses cache

 I see no real advantage there, except that it has a built-in
 memory leak. Just use a function.

Good point. Well the merits of using a cache are completely independent
from the behavioural aspects. I simply changed the existing code as
little as possible. Hence this patch will have the same performance
strengths/weaknesses as all previous versions, and the performance can
be tuned after 3.0 if necessary. (Not urgent).

On statistics about UTF-8 versus other encodings. Yes, I agree, there
are lots of URIs floating around out there, in many different encodings.
Unfortunately, we can't implicitly handle them all (and I'm talking once
more explicitly about the str-str transform here). We need to pick one
as the default. Whether Latin-1 is more popular than UTF-8 *for the time
being* is no good reason to pick Latin-1. It is called a legacy
encoding for a reason. It is being phased out and should NOT be
supported from here on in as the default encoding in a major web
programming language.

(Also there is no point in claiming to be Unicode compliant then
turning around and supporting a charset with 256 symbols by default).

Because Python's urllib will mostly be used in the context of building
web apps, it is up to the programmer to decide what encoding to use for
h(is|er) web app. For future apps, this should almost certainly be UTF-8
(if it isn't, the website won't be able to accept form input across all
characters, so isn't Unicode compliant anyway).

The problem you mention of browsers submitting URIs encoded based on the
charset is simply something we have to live with. A server will never be
able to deal with that unless the URIs are coming

[issue3300] urllib.quote and unquote - Unicode issues

2008-08-12 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

By the way, what is the current status of this bug? Is anybody waiting
on me to do anything? (Re: Patch 9)

To recap my previous list of outstanding issues raised by the review:

 Should unquote accept a bytes/bytearray as well as a str?
Currently, does not. I think it's meaningless to do so (and how to
handle 127 bytes, if so?)

 Lib/email/utils.py:
 Should encode_rfc2231 with charset=None accept strings with non-ASCII
 characters, and just encode them to UTF-8?
Currently does. Suggestion to restrict to ASCII on the review tracker;
simple fix.

 Should quote raise a TypeError if given a bytes with encoding/errors
 arguments? (Motivation: TypeError is what you usually raise if you
 supply too many args to a function).
Resolved. Raises TypeError.

 Lib/urllib/parse.py:
 (As discussed above) Should quote accept safe characters outside the
 ASCII range (thereby potentially producing invalid URIs)?
Resolved? Implemented, but too messy and not worth it just to produce
invalid URIs, so NOT in patch.

That's only two very minor yes/no issues remaining. Please comment.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Guido suggested that quote's safe parameter should allow any
character, not just ASCII range. I've implemented this now. It was a lot
messier than I imagined.

The problem is that in my older patches, both 's' and 'safe' are encoded
to bytes right away, and the rest of the process is just octet encoding
(matching each byte against the safe set to see whether or not to quote it).

The new implementation requires that you delay encoding both of these
till the iteration over the string, so you match each *character*
against the safe set, then encode it if it's not in 'safe'. Now the
problem is some encodings/errors produce bytes which are in the safe
range. For instance quote('\u6f22', encoding='latin-1',
errors='xmlcharrefreplace') should give %26%2328450%3B (which is
#28450; encoded). To preserve this behaviour, you then have to check
each *byte* of the encoded character against a 'safe bytes' set. I
believe that will slow down the implementation considerably.

In summary, it requires two levels of encoding: first characters, then
bytes. You can see how messy it made my quote implementation - I've
attached the patch (parse.py.patch8+allsafe).

I don't think it's worth the extra code bloat and performance hit just
to implement a feature whose only use is producing invalid URIs (since
URIs are supposed to only have ASCII characters). Does anyone disagree,
and want this feature in?

Added file: http://bugs.python.org/file11092/parse.py.patch8+allsafe

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Made a bunch of requested changes (I've reverted the all safe patch
for now since it caused so much grief; see above).

* quote: Fixed encoding illegal % sequences (and lots of new test cases
to prove it).
* quote now throws a type error if s is bytes, and encoding or errors
supplied.
* A few minor documentation fixes.

Patch 9.
Commit log for patch8 should suffice.

Added file: http://bugs.python.org/file11093/parse.py.patch9

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 Except, when we look at the context. This is bytes class
 method returns a bytes or bytearray object, decoding the given
 string object.

 Do we require an opposite in the bytes class method? Where will
 we use it?
No, tohex is not a class method (unlike fromhex). It's just a regular
method on the bytes object.

 No, it is not going away. str.encode('hex') is available to
 users when they seek it. They wont look for it under bytes type.

 'hello'.encode('hex')
LookupError: unknown encoding: hex

This is deliberate, I'm pretty sure. encode/decode are for converting
to/from unicode strings and bytes. It never made sense to have hex in
there, which actually goes the other way. And it makes no sense to
encode a Unicode string as hex (since they aren't bytes). So it's good
that that went away.

I'm just saying it should have something equally accessible to replace it.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 Invalid user input? What if the query string comes from filling
 a form?
 For example if I search the word numéro in a latin1 Web site,
 I get the following URL:
 http://www.le-tigre.net/spip.php?page=rechercherecherche=num%E9ro

Yes, that is a concern. I suppose the idea should be that as the
programmer _you_ write the website, so you make it UTF-8 and you use our
defaults. Or you make it Latin-1, and you override our defaults (which
is tricky if you use cgi.FieldStorage, for example).

But anyway, how do you propose to handle that (other than the programmer
setting the correct default). With errors='replace', the above query
will result in num�ro, but with errors='strict', it will result in a
UnicodeDecodeError (which you could handle, if you remembered). As a
programmer I don't really want to handle that error every time I use
unquote or anything that calls unquote. I'd rather accept the
possibility of '�'s in my input.

I'm not going to dig in my heels though, this time :) I just want to
make sure the consequences of this decision are known before we commit.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Oh, where's the information on those?

(A brief search of the peps and bug tracker shows nothing).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

OK thanks.

Well I still can't really see what transform/untransform are about. Is
it OK to keep this issue open (and listed as 3.1) until more information
becomes available on those methods?

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

So I assumed.

In that case, why is there a fromhex? (Was that put in there before
the notion of transform/untransform?) As I've been saying, it's weird to
have a fromhex but not a tohex.

Anyway, assuming we go to 3.1 and add transform/untransform, I suppose
fromhex will remain for backwards, and tohex will not be needed. So I
guess this issue is closed.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-09 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

I haven't been able to find a way to encode a bytes object in
hexadecimal, where in Python 2.x I'd go str.encode('hex').

I recommend adding a bytes.tohex() method (in the same vein as the
existing bytes.fromhex class method).

I've attached a patch which adds this method to the bytes and bytearray
classes (in the C code). Also included documentation and test cases.

Style note: The bytesobject.c and bytearrayobject.c files are all over
the place in terms of tabs/spaces. I used tabs in bytesobject and spaces
in bytearrayobject, since those seemed to be the predominant styles in
either file.

Commit log:

Added tohex method to bytes and bytearray objects. Also added
documentation and test cases.

--
components: Interpreter Core
files: bytes.tohex.patch
keywords: patch
messages: 70932
nosy: mgiuca
severity: normal
status: open
title: bytes.tohex method
type: feature request
versions: Python 3.0
Added file: http://bugs.python.org/file11091/bytes.tohex.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3532] bytes.tohex method

2008-08-09 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 I recommend to use binascii.hexlify.

Ah, see I did not know about this! Thanks for pointing it out.

* However, it is *very* obscure. I've been using Python for a year and I
didn't know about it.
* And, it requires importing binascii.
* And, it results in a bytes object, not a str. That's weird. (Perhaps
it would be good idea to change the functions in the binascii module to
output strings instead of bytes? Ostensibly it looks like this module
hasn't undergone py3kification).

Would it hurt to have the tohex method of the bytes object to perform
this task as well? It would be much nicer to use since it's a method of
the object rather than having to find out about and import and use some
function.

Also why have a bytes.fromhex method when you could use binascii.unhexlify?

(If it's better from a code standpoint, you could replace the code I
wrote with a call to binascii.unhexlify).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why you had to
rewrite everything to implement the same basic behaviour as my patch.
(My latest few patches support bytes both ways). Anyway, I have a lot of
issues with your implementation.

* Why did you replace all of the existing machinery? Particularly the
way quote creates Quoter objects and stores them in a cache. I haven't
done any speed tests, but I assume that was all there for performance
reasons.

* The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
as a str. This is the first time in the whole discussion we've
represented a URI as bytes, not a str. URIs are not byte sequences, they
are character sequences (see discussion below). I think only
quote_as_string is valid.

* The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
and quote_from_* to avoid ambiguity.

* Are unquote_as_string and unquote both part of your public interface?
That seems like unnecessary duplication.

* As Antoine pointed out above, it's too limiting for quote to force
UTF-8. Add a 'charset' parameter.

* Add an 'errors' parameter too, to give the caller control over how
strict to be.

* unquote and unquote_plus are missing 'charset' param, which should be
passed along to unquote_as_string.

* I do like the addition of a plus argument, as opposed to the
separate unquote_plus and quote_plus functions. I'd swap the arguments
to unquote around so charset is first and then plus, so you can write
unquote(mystring, 'utf-8') without using a keyword argument.

* In unquote: The raw_unicode_escape encoding makes no sense. It does
exactly the same thing as Latin-1, except it also looks for b\\u
in the string and converts that into a Unicode character. So your code
behaves like this:

 urllib.parse.unquote('%5Cu00fc')
'ü'
(Should output \u00fc)
 urllib.parse.unquote('%5Cu')
UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
position 11-12: truncated \u
(Should output \u)

I suspect the email package (where you got the inspiration to use
'rawunicodeescape') has this same crazy problem, but that isn't my
concern today!

Aside from this weirdness, you're essentially defaulting unquote to
Latin-1. As I've said countless times, unquote needs to be the inverse
of quote, or you get this behaviour:

 urllib.parse.unquote(urllib.parse.quote('ü'))
'ü'

Once again, I refer you to my favourite web server example.

import http.server
s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
s.serve_forever()

Run this in a directory with a non-Latin-1 filename (eg. 漢字), and
you will get a 404 when you click on the file.

* One issue I worked very hard to solve is how to deal with unescaped
non-ASCII characters in unquote. Technically this is an invalid URI, so
I'm not sure how important it is, but it's nice to be able to assume the
unquote function won't mess with them. For example,
unquote_as_string(\u6f22%C3%BC, charset=latin-1) should give
\u6f22\u00fc (or at least it would be nice). Yours raises
UnicodeEncodeError: 'ascii' codec can't encode character. (I assume
this is a wanted property, given that the existing test suite tests that
unquote can handle ALL unescaped ASCII characters (that's what
escape_string in test_unquoting is for) - I merely extend this concept
to be able to handle all unescaped Unicode characters). Note that it's
impossible to allow such lenience if you implement unquote_as_string as
calling unquote_as_bytes and then decoding.

* Question: How does unquote_bytes deal with unescaped characters?
(Since this is a str-bytes transform, you need to encode them somehow).
I don't have a good answer for you here, which is one reason I think
it's wrong to treat a URI as an octet encoding. I treat them as UTF-8.
You treat them as ASCII. Since URIs are supposed to only contain ASCII,
the answers ASCII, Latin-1 and UTF-8 are all as good as each
other, but as I said above, I prefer to be lenient and allow non-ASCII
URIs as input.

* Needs a lot more test cases, and documentation for your changes. I
suggest you plug my new test cases for urllib in and see if you can make
your code pass all the things I test for (and if not, have a good reason).

In addition, a major problem I have is with this dangerous assumption
that RFC 3986 specifies a byte-str encoding. You keep stating
assumptions like this:

 Remember that the RFC for percent-encoding really takes
 bytes in, and produces bytes out.  The string-in and string-out
 versions are to support naive programming (what a nice way of
 putting it!).

You assume that my patch, the string version of quote/unquote, is a
hack in order to satisfy the naive souls who only want to deal with
strings, while your method is the pure and correct solution. This is
in no way

[issue3532] bytes.tohex method

2008-08-09 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

You did the 3.1 thing again! We can accept a new feature like this
before 3.0b3, can we not?

 Hmm. There are probably many modules that you haven't used yet.

Snap :)

Well, I didn't know about the community's preference for functions over
methods. You make a lot of good points.

I think the biggest problem I have is the existence of fromhex. It's
really strange/inconsistent to have a fromhex without a tohex.

Also I think a lot of people (like me, in my relative inexperience) are
going to be at a loss as to why .encode('hex') went away, and they'll
easily be able to find .tohex (by typing help(bytes), or just guessing),
while binascii.hexlify is sufficiently obscure that I had to ask.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3532
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 Bill's main concern is with a policy decision; I doubt he would
 object to using your code once that is resolved.

But his patch does the same basic operations as mine, just implemented
differently and with the heap of issues I outlined above. So it doesn't
have anything to do with the policy decision.

 The purpose of the quoting functions is to turn a string
 (representing the human-readable version) into bytes (that go
 over the wire).

Ah hang on, that's a misunderstanding. There is a two-step process involved.

Step 1. Translate character/byte string into an ASCII character string
by percent-encoding the characters/bytes. (If percent-encoding
characters, use an unspecified encoding).
Step 2. Serialize the ASCII character string into an octet sequence to
send it over the wire, using some unspecified encoding.

Step 1 is explained in detail throughout the RFC, particularly in
Section 1.2.1 Transcription (Percent-encoded octets may be used within
a URI to represent characters outside the range of the US-ASCII coded
character set) and 2.1 Percent Encoding.

Step 2 is not actually part of the spec (because the spec outlines URIs
as character sequences, not how to send them over a network). It is
briefly described in Section 2 (This specification does not mandate any
particular character encoding for mapping between URI characters and the
octets used to store or transmit those characters.  When a URI appears
in a protocol element, the character encoding is defined by that protocol).

Section 1.2.1:

 A URI may be represented in a variety of ways; e.g., ink on
 paper, pixels on a screen, or a sequence of character
 encoding octets.  The interpretation of a URI depends only on
 the characters used and not on how those characters are
 represented in a network protocol.

The RFC then goes on to describe a scenario of writing a URI down on a
napkin, before stating:

 A URI is a sequence of characters that is not always represented
 as a sequence of octets.

Right, so there is no debate that a URI (after percent-encoding) is a
character string, not a byte string. The debate is only whether it's a
character or byte string before percent-encoding.

Therefore, the concept of quote_as_bytes is flawed.

 You feel wire-protocol bytes should be treated as
 strings, if only as bytestrings, because the libraries use them
 that way.

No I do not. URIs post-encoding are character strings, in the Unicode
sense of the term character. This entire topic has nothing to do with
the wire.

Note that the charset or encoding parameter in Bill/My patch
respectively isn't the mapping from URI strings to octets (that's
trivially ASCII). It's the charset used to encode character information
into octets which then get percent-encoded.

 The old code (and test cases) assumed Latin-1.

No, the old code and test cases were written for Python 2.x. They
assumed a byte string was being emitted (back when a byte string was a
string, so that was an acceptable output type). So they weren't assuming
an encoding. In fact the *ONLY* test case for Unicode in test_urllib
used a UTF-8-encoded string.

 r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc')
 self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc')

In Python 2.x, this test case says unquote('%C3%BC') should give me the
byte sequence '\xc3\xbc', which is a valid case. In Python 3.0, the
code didn't change but the meaning subtly did. Now it says
unquote('%C3%BC') should give the string 'ü'. The name is clearly
supposed to be brückner, not brückner, which means in Python 3.0 we
should EITHER be expecting the BYTE string b'\xc3\xbc' or the character
string 'ü'.

So the old code and test cases didn't assume any encoding, then they
were accidentally made to assume Latin-1 by the fact that the language
changed underneath them.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-09 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

I've been thinking more about the errors=strict default. I think this
was Guido's suggestion. I've decided I'd rather stick with errors=replace.

I changed errors=replace to errors=strict in patch 8, but now I'm
worried that will cause problems, specifically for unquote. Once again,
all the code in the stdlib which calls unquote doesn't provide an errors
option, so the default will be the only choice when using these other
services.

I'm concerned that there'll be lots of unhandled exceptions flying
around for URLs which aren't encoded with UTF-8, and a conscientious
programmer will not be able to protect against user errors.

Take the cgi module as an example. Typical usage is to write:
 fields = cgi.FieldStorage()
 foo = fields.getFirst(foo)

If the QUERY_STRING is foo=w%FCt (Latin-1), with errors='strict', you
get a UnicodeDecodeError when you call cgi.FieldStorage(). With
errors='replace', the variable foo will be w�t. I think in general I'd
rather have '�'s in my program (representing invalid user input) than
exceptions, since this is usually a user input error, not a programming
error.

(One problem is that all I can do to handle this is catch a
UnicodeDecodeError on the call to FieldStorage; then I can't access any
of the data).

Now maybe something we can think about is propagating the encoding and
errors argument through to a few other major functions (such as
cgi.parse_qsl, cgi.FieldStorage and urllib.parse.urlencode), but that
should be separately to this patch.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Dear GvR,

New code review comments by mgiuca have been published.
Please go to http://codereview.appspot.com/2827 to read them.

Message:
Hi Guido,

Thanks very much for this very detailed review. I've replied to the
comments. I will make the changes as described below and send a new
patch to the tracker.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

A reply to a point on GvR's review, I'd like to open for discussion.
This relates to whether or not quote's safe argument should allow
non-ASCII characters.

 Using errors='ignore' seems like a mistake -- it will hide errors. I 
also wonder why safe should be limited to ASCII though.

The reasoning is this: if we allow non-ASCII characters to be escaped,
then we allow quote to generate invalid URIs (URIs are only allowed to
have ASCII characters). It's one thing for unquote to accept such URIs,
but I think we shouldn't be producing them. Albeit, it only produces an
invalid URI if you explicitly request it. So I'm happy to make the
change to allow any character to be safe, but I'll let it go to
discussion first.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 The important is that the defaults are safe. If users want to override
 the defaults and produce potentially invalid URIs, there is no reason to
 discourage them.

OK I think that's a fairly valid argument. I'm about to head off so I'll
post the patch I have now, which fixes most of the other concerns. That
change will cause havoc to quote I think ;)

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.

Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

--

Commit log for patch8:

Fix for issue 3300.

urllib.parse.unquote: Added encoding and errors optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is utf-8 (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
%aF) weren't being decoded at all.

urllib.parse.quote: Added encoding and errors optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is utf-8 (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
safe. Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding=latin-1, to preserve existing behaviour (which the whole
email module is dependent upon).

Added file: http://bugs.python.org/file11069/parse.py.patch8

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

I'm also attaching a metapatch - diff from patch 7 to patch 8. This is
to give a rough idea of what I changed since the review.

(Sorry - This is actually a diff between the two patches, so it's pretty
hard to read. It would have been nicer to diff the files themselves but
I'm not doing local commits so that's hard. Can one use the Bazaar
mirror for development, or is it too out-of-date?)

Added file: http://bugs.python.org/file11070/parse.py.metapatch8

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

OK after a long discussion on the mailing list, Guido gave this the OK,
with the provision that there are str-bytes and bytes-str versions of
these functions as well. So I've written those.

http://mail.python.org/pipermail/python-dev/2008-July/081601.html

quote itself now accepts either a str or a bytes. quote_from_bytes is a
new function which is just an alias for quote. (Is this acceptable?)

unquote is still str-str. I've added a totally separate function
unquote_to_bytes which is str-bytes.

Note there is a slight issue here: I didn't quite know what to do with
unescaped non-ASCII characters in the input to unquote_to_bytes - they
need to somehow be converted to bytes. I chose to encode them using
UTF-8, on the basis that they technically shouldn't be in a URI anyway.

Note that my new unquote doesn't have this problem; it's carefully
written to preserve the Unicode characters, even if they aren't
expressible in the given encoding (which explains some of the code bloat).

This makes unquote(s, encoding=e) necessarily more robust than
unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters
in the input.

I've also added new test cases and documentation for these two new
functions (included in patch6).

On an entirely personal note, can whoever checks this in please mention
my name in the commit log - I've put in at least 30 hours researching
and writing this patch, and I'd like for this not to go uncredited :)

Commit log for patch6:

Fix for issue 3300.

urllib.parse.unquote: Added encoding and errors optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is utf-8 (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added encoding and errors optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is utf-8 (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
safe. Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding=latin-1, to preserve existing behaviour (which the whole
email module is dependent upon).

Added file: http://bugs.python.org/file11009/parse.py.patch6

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3478] Documentation for struct module is out of date in 3.0

2008-07-31 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

The documentation for the struct module still uses the term string
even though the struct module itself deals entirely in bytes objects in
Python 3.0.

I propose updating the documentation to reflect the 3.0 terminology.

I've attached a patch for the Docs/library/struct.rst file. It mostly
renames string to bytes. It also notes that pack for 'c', 's' and
'p' accepts either string or bytes, but unpack spits out a bytes.

One important point: If you pass a str to 'c', 's' or 'p', it will get
encoded with UTF-8 before being packed. I've described this behaviour in
the documentation. I'm not sure if this should be described as the
official behaviour, or just informatively.

I've traced this behaviour to Modules/_struct.c lines 607, 1650 and 1676
(for 'c', 's' and 'p' respectively), which calls
_PyUnicode_AsDefaultEncodedString. This is found in
Object/unicodeobject.c:1410, which directly calls PyUnicode_EncodeUTF8.

Hence the UTF-8 encoding is not system or locale specific - it will
always happen. However, perhaps we should loosen the documentation to
say which are encoded using a default encoding scheme.

It would be good if the authors of the struct module read over these
changes first, to make sure I am describing it correctly.

I have also updated Modules/_struct.c's doc strings and exception
messages to reflect this new terminology. (I've changed nothing besides
the contents of these strings - test case passes, just to be safe).

Patch is for /python/branches/py3k/, revision 65324.

Commit Log:

Docs/library/struct.rst: Updated documentation to Python 3.0 terminology
(bytes instead of strings). Added note that packing 'c', 's' or 'p'
accepts either str or bytes.

Modules/_struct.c: Updated doc strings and exception messages to the same.

--
assignee: georg.brandl
components: Documentation
files: struct-doc.patch
keywords: patch
messages: 70506
nosy: georg.brandl, mgiuca
severity: normal
status: open
title: Documentation for struct module is out of date in 3.0
versions: Python 3.0
Added file: http://bugs.python.org/file11013/struct-doc.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3478
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's
minor, gives a harmless BytesWarning if you run with -b, which make
test does, so I only picked it up after submitting).

I've slightly changed the code in quote so it doesn't do that any more
(it normalises all safe arguments to bytes).

Please review patch 7, not 6. Same commit log as above.

(Also .. someone let me know if I'm not submitting patches properly,
like perhaps I should be deleting the old ones not keeping them around?)

Added file: http://bugs.python.org/file11015/parse.py.patch7

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3478] Documentation for struct module is out of date in 3.0

2008-07-31 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Thanks for the props!

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3478
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3348] Cannot start wsgiref simple server in Py3k

2008-07-22 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Are you saying the stream passed to _write SHOULD always be a binary
stream, and hence the test case is wrong, because it opens a text stream?

(I'm not sure where the stream comes from, but we should guarantee it's
a binary stream).

Also, why Latin-1?

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3348
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3348] Cannot start wsgiref simple server in Py3k

2008-07-22 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Wow, I read the WSGI spec. That seems very strange that it says HTTP
does not directly support Unicode, and neither does this interface.
Clearly HTTP *does* support Unicode, because it allows you to specify an
encoding.

I assume then that the ISO-8859-1 characters the WSGI functions receive
will be treated as byte values. (That's rather silly; it's just dodging
the issue of Unicode rather than supporting it).

But in any event, the PEP has spoken, so we stick with Latin-1.

With respect to the text/binary stream, I think it would be best if it's
a binary stream, and we explicitly convert those str objects (which WSGI
says must only contain Latin-1 range characters) into bytes objects
(simply treating code points as bytes; in other words calling
.encode('latin-1')) and writing them to the binary stream. (Since the
WSGI spec is so adamant we deal in bytes).

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3348
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

OK I spent awhile writing test cases for quote and unquote, encoding and
decoding various Unicode strings with different encodings. As a result,
I found a bunch of issues in my previous patch, so I've rewritten the
patches to both quote and unquote. They're both actually more similar to
the original version now.

I'd be interested in hearing if anyone disagrees with my expected output
for these test cases.

I'm now confident I have good test coverage directly on the quote and
unquote functions. However, I haven't tested the other library functions
which depend upon them (though the entire test suite passes). Though as
I showed in that big post I made yesterday, other modules such as cgi
seem to be working fine (their behaviour has changed; they use UTF-8
now; but that's the whole point of this patch).

I still haven't figured out what the behaviour of safe should be in
quote. Should it only allow ASCII characters (thereby limiting the
output to an ASCII string, as specified by RFC 3986)? Should it also
allow Latin-1 characters, or all Unicode characters as well (perhaps
allowing you to create IRIs -- admittedly I don't know much about IRIs).
The new implementation of quote makes it rather difficult to allow
non-Latin-1 characters to be made safe, as it encodes the string into
bytes before any processing.

Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891.

Commit log:

urllib.parse.unquote: Added encoding and errors optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is utf-8 (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added encoding and errors optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is utf-8 (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be safe.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py: Updated test case which expected output
in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding=latin-1, to preserve existing behaviour (which the whole
email module is dependent upon).

Added file: http://bugs.python.org/file10883/parse.py.patch4

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3347] urllib.robotparser doesn't work in Py3k

2008-07-12 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

urllib.robotparser is broken in Python 3.0, due to a bytes object
appearing where a str is expected.

Example:

 import urllib.robotparser
 r =
urllib.robotparser.RobotFileParser('http://www.python.org/robots.txt')
 r.read()
TypeError: expected an object with the buffer interface

This is because the variable f in RobotFileParser.read is opened by
urlopen as a binary file, so f.read() returns a bytes object.

I've included a patch, which checks if it's a bytes, and if so, decodes
it with 'utf-8'. A more thorough fix might figure out what the charset
of the document is (in f.headers['Content-Type']), but at least this
works, and will be sufficient in almost all cases.

Also there are no test cases for urllib.robotparser.

Patch (robotparser.py.patch) is for branch /branches/py3k, revision 64891.

Commit log:

Lib/urllib/robotparser.py: Fixed robotparser for Python 3.0. urlopen
returns bytes objects where str expected. Decode the bytes using UTF-8.

--
components: Library (Lib)
files: robotparser.py.patch
keywords: patch
messages: 69586
nosy: mgiuca
severity: normal
status: open
title: urllib.robotparser doesn't work in Py3k
type: behavior
versions: Python 3.0
Added file: http://bugs.python.org/file10885/robotparser.py.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3347
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3348] Cannot start wsgiref simple server in Py3k

2008-07-12 Thread Matt Giuca

New submission from Matt Giuca [EMAIL PROTECTED]:

The wsgiref simple server module has a demo server, which fails to
start in Python 3.0 for a bunch of reasons.

To verify this, just go into the Lib/wsgiref directory, and run:
python3.0 ./simple_server.py
(which launches the demo server).

This opens your web browser and points it at the server, and you get the
following error:

ValueError: need more than 1 value to unpack

I fixed a number of issues which simply killed the server:

* In get_environ, it did not iterate over the headers mapping properly
at all (was expecting a sequence of strings, it actually is a mapping).
I think the email.message.Message class changed. Fixed.
* In demo_app, it calls sort on the output of dict.items() - a list in
Python 2, but an iterator in Python 3, so it fails. Fixed (using sorted).

Unfortunately, the final issue is a bit harder to fix. It seems when I
run the demo server, it opens a binary stream, but handlers.py sends
strings to be written, giving the error

TypeError: send() argument 1 must be bytes or read-only buffer, not str

However in the test case, it opens a text stream, so handlers.py works fine.

The following *HACK* fixes it so the demo server works, but breaks the
test suite (it is NOT included in the attached patch):

--- Lib/wsgiref/handlers.py (revision 64895)
+++ Lib/wsgiref/handlers.py (working copy)
@@ -382,8 +382,8 @@
 self.environ.update(self.base_env)
 
 def _write(self,data):
-self.stdout.write(data)
-self._write = self.stdout.write
+self.stdout.write(data.encode('utf-8'))
+#self._write = self.stdout.write
 
I can't figure out right away what to do about this, but the best
solution would be to get the demo server to open the socket in text mode.

In any case, the patch is attached for branch /branches/py3k, revision
64895.

Commit log:

* Lib/wsgiref/simple_server.py: Fixed two fatal errors which prevent the
demo server from running (broken due to Python 3.0).
Note: Demo server may still not run due to an issue between strings and
bytes.

--
components: Library (Lib)
files: simple_server.py.patch
keywords: patch
messages: 69587
nosy: mgiuca
severity: normal
status: open
title: Cannot start wsgiref simple server in Py3k
type: behavior
versions: Python 3.0
Added file: http://bugs.python.org/file10886/simple_server.py.patch

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3348
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3300] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

So today I grepped for urllib in the entire library in an effort to
track down every dependency on quote and unquote to see exactly how my
patch breaks other code. I've now investigated every module in the
library which uses quote, unquote or urlencode, and my findings are
documented below in detail.

So far I have found no code breakage except for the original
email.util issue I fixed in patch 2. Of course that doesn't mean the
behaviour hasn't changed. Nearly all modules in the report below have
changed their behaviour so they used to deal with Latin-1-encoded URLs
and now deal with UTF-8-encoded URLs. As discussed at length above, I
see this as a positive change, since nearly everybody encodes URLs in
UTF-8, and of course it allows for all characters.

I also point out that the http.server module (unpatched) is internally
broken when dealing with filenames with characters outside range(0,256);
my patch fixes it.

I'm attaching patch 5, which adds a bunch of new test cases to various
modules which demonstrate those modules correctly handling UTF-8-encoded
URLs. It also fixes a bug in email.utils which I introduced in patch 2.

Note that I haven't yet fully investigated urllib.request.

Aside from that, the only remaining matter is whether or not it's better
to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
question doesn't need debate.

So basically I think if there's support for it, this patch is just about
ready to be accepted. I'm hoping it can be included in the 3.0b2 release
next week.

I'd be glad to hear any feedback about this proposal.

Not Yet Investigated


./urllib/request.py
By far the biggest user of quote and unquote.
username, password, hostname and paths are now all converted
to/from UTF-8 percent-encodings.
Other concerns are:
* Data in the form application/x-www-form-urlencoded
* FTP access
I think this needs to be tested further.

Looks fine, not tested
--

./xmlrpc/client.py
Just used to decode URI auth string (user:pass). This will change
to UTF-8, but is probably OK.
./logging/handlers.py
Just uses it in the HTTP handler to encode a dictionary. Probably
preferable to use UTF-8 to encode an arbitrary string.
./macurl2path.py
Calls to urllib look broken. Not tested.

Tested manually, fine
-

./wsgiref/simple_server.py
Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
./http/server.py
All uses are for translating between actual file-system paths to
URLs. This works fine for UTF-8 URLs. Note that since it uses
quote to create URLs in a dir listing, and unquote to handle
them, it breaks when unquote is not the inverse of quote.

Consider the following simple script:

import http.server
s = http.server.HTTPServer(('',8000),
http.server.SimpleHTTPRequestHandler)
s.serve_forever()

This will kind of work in the unpatched version, using
Latin-1 URLs, but filenames with characters above 256 will
break (give a 404 error).
The patch fixes this.
./urllib/robotparser.py
No test cases. Manually tested, URLs properly match when
percent-encoded in UTF-8.
./nturl2path.py
No test cases available. Manually tested, fine if URLs are
UTF-8 encoded.

Test cases either exist or added, fine
--

./test/test_urllib.py
I wrote a large wad of test cases for all the new functionality.
./wsgiref/util.py
Added test cases expecting UTF-8.
./http/cookiejar.py
I changed a test case to expect UTF-8.
./email/utils.py
I changed this file to behave as it used to, to satisfy its
existing test cases.
./cgi.py
Added test cases for UTF-8-encoded query strings.

Commit log:

urllib.parse.unquote: Added encoding and errors optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is utf-8 (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added encoding and errors optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is utf-8 (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters above 128 are no longer allowed to be safe.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings. This includes
updating one test case to now expect UTF-8 by default.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding=latin-1, to preserve existing behaviour (which

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-11 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

 3.0b1 has been released, so no new features can be added to 3.0.

While my proposal is no doubt going to cause a lot of code breakage, I
hardly consider it a new feature. This is very definitely a bug. As I
understand it, the point of a code freeze is to stop the addition of
features which could be added to a later version. Realistically, there
is no way this issue can be fixed after 3.0 is released, as it
necessarily involves changing the behaviour of this function.

Perhaps I should explain further why this is a regression from Python
2.x and not a feature request. In Python 2.x, with byte strings, the
encoding is not an issue. quote and unquote simply encode bytes, and if
you want to use Unicode you have complete control. In Python 3.0, with
Unicode strings, if functions manipulate string objects, you don't have
control over the encoding unless the functions give you explicit
control. So Python 3.0's native Unicode strings have broken the library.

I give two examples.

Firstly, I believe that unquote(quote(x)) should always be true for all
strings x. In Python 2.x, this is always trivially true (for non-Unicode
strings), because they simply encode and decode the octets. In Python
3.0, the two functions are inconsistent, and break out of the range(0, 256).

 urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
'ÿ'
# Works, because both functions work with ISO-8859-1 in this range.

 urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ä\x80'
# Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.

My patch succeeds for all characters.
 urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
'Ā'

Secondly, a bigger example, but I want to demonstrate how this bug
affects web applications, even very simple ones.

Consider this simple (beginnings of a) wiki system in Python 2.5, as a
CGI app:

#---
import cgi

fields = cgi.FieldStorage()
title = fields.getfirst('title')

print(Content-Type: text/html; charset=utf-8)
print()

print('pDebug: %s/p' % repr(title))
if title is None:
print(No article selected)
else:
print('pInformation about %s./p' % cgi.escape(title))
#---

(Place this in cgi-bin, navigate to it, and add the query string
?title=Page Title). I'll use the page titled Mátt as a test case.

If you navigate to ?title=Mátt, it displays the text Debug:
'M\xc3\xa1tt'. Information about Mátt.. The browser (at least Firefox,
Safari and IE I have tested) encodes this as ?title=M%C3%A1tt. So this
is trivial, as it's just being unquoted into a raw byte string
'M\xc3\xa1tt', then written out again as a byte string.

Now consider that you want to manipulate it as a Unicode string, still
in Python 2.5. You could augment the program to decode it as UTF-8 and
then re-encode it. (I wrote a simple UTF-8 printing function which takes
Unicode strings as input).

#---
import sys
import cgi

def printu8(*args):
Prints to stdout encoding as utf-8, rather than the current terminal
encoding. (Not a fully-featured print function).
sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
sys.stdout.write('\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
if title is not None:
title = str(title).decode(utf-8, replace)

print(Content-Type: text/html; charset=utf-8)
print()

print('pDebug: %s./p' % repr(title))
if title is None:
print(No article selected.)
else:
printu8('pInformation about %s./p' % cgi.escape(title))
#---

Now given the same input (?title=Mátt), it displays Debug:
u'M\xe1tt'. Information about Mátt. Still working fine, and I can
manipulate it as Unicode because in Python 2.x I have direct control
over encoding/decoding.

Now let us upgrade this program to Python 3.0. (Note that I still can't
print Unicode characters directly out, because running through Apache
the stdout encoding is not UTF-8, so I use my printu8 function).

#---
import sys
import cgi

def printu8(*args):
Prints to stdout encoding as utf-8, rather than the current terminal
encoding. (Not a fully-featured print function).
sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
sys.stdout.buffer.write(b'\n')

fields = cgi.FieldStorage()
title = fields.getfirst('title')
# Note: No call to decode. I have no opportunity to specify the encoding
since
# it comes straight out of FieldStorage as a Unicode string.

print(Content-Type: text/html; charset=utf-8)
print()

print('pDebug: %s./p' % ascii(title))
if title is None:
print(No article selected.)
else:
printu8('pInformation about %s./p' % cgi.escape(title))
#---

Now given the same input (?title=Mátt), it displays Debug:
'M\xc3\xa1tt'. Information about Mátt. Once again, it is erroneously
(and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
Unicode string. The only possible thing I can do about this as a web
developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

With my patch applied, the input

[issue3300] urllib.quote and unquote - Unicode issues

2008-07-11 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

Since I got a complaint that my last reply was too long, I'll summarize it.

It's a bug report, not a feature request.

I can't get a simple web app to be properly Unicode-aware in Python 3,
which worked fine in Python 2. This cannot be put off until 3.1, as any
viable solution will break existing code.

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3300
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3330] webbrowser module doesn't correctly handle '|' character.

2008-07-10 Thread Matt Giuca

Matt Giuca [EMAIL PROTECTED] added the comment:

I was able to duplicate this, but it's an issue with Firefox, not
Python. webbrowser.open(url) just passes url as a command line argument
to the web browser; it doesn't do any manipulation.

Note that you get the exact same behaviour if you run Firefox from the
command line:

 firefox 'http://foo.com/bar.html?var=x|y|z'

Opens this URL in a new tab if it's already open, but splits on '|' and
opens in 3 separate tabs if Firefox isn't running.

Note also that while this string is a URL, it isn't properly normalized.

This works fine if you call

webbrowser.open(http://foo.com/bar.html?var=x%7Cy%7Cz;)

(Which you can only obtain programmatically by generating the URL
properly in the first place, by using urllib.urlencode, or urllib.quote
on the value string x|y|z).

--
nosy: +mgiuca

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue3330
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   >