Re: [Python-Dev] urllib unicode handling
Hi, Jeroen Ruigrok van der Werven asmodai at in-nomine.org writes: Would people object if such functionality got added to urllib? I would ;-) There are IRIs, just that nobody wrote a useful module for that. There are algorithms in the RFC that can convert URIs to IRIs and the other way round. IMO that's the way to go. Regards, Armin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeroen Ruigrok van der Werven Sent: Wednesday, May 07, 2008 05:20 To: Tom Pinckney Cc: python-dev@python.org Subject: Re: [Python-Dev] urllib unicode handling -On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote: While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and Wikipedia seem to have embraced it (as have pretty much all other major web sites). As with HTML, there is what the standard says and what the actual browsers have to accept in order to work in the real world. FYI, here is how we have patched urrlib2 for use in EVE: --- C:\p4\sdk\stackless25\Lib\urllib.py 2008-03-21 14:47:23.0 - +++ C:\p4\eve\KALI\common\stdlib\urllib.py 2007-11-06 11:18:01.0 - @@ -1158,12 +1158,29 @@ except KeyError: res[i] = '%' + item except UnicodeDecodeError: res[i] = unichr(int(item[:2], 16)) + item[2:] return .join(res) +unquote_inner = unquote +def unquote(s): +CCP attempt at making sensible choices in unicode quoteing / unquoting +s = unquote_inner(s) +try: +u = s.decode(utf-8) +try: +s2 = s.decode(ascii) +except UnicodeDecodeError: +s = u #yes, s was definitely utf8, which isn't pure ascii +else: +if u != s: +s = u +except UnicodeDecodeError: +pass #can't have been utf8 +return s + def unquote_plus(s): unquote('%7e/abc+def') - '~/abc def' s = s.replace('+', ' ') return unquote(s) always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ' @@ -1201,12 +1218,20 @@ for i in range(256): c = chr(i) safe_map[c] = (c in safe) and c or ('%%%02X' % i) _safemaps[cachekey] = safe_map res = map(safe_map.__getitem__, s) return ''.join(res) + +quote_inner = quote +def quote(s, safe = '/'): +CCP addition, to try to sensibly support / circumvent issues with unicode in urls +try: +return quote_inner(s, safe) +except KeyError: +return quote_inner(s.encode(utf-8, safe)) def quote_plus(s, safe = ''): Quote the query fragment of a URL; replacing ' ' with '+' if ' ' in s: s = quote(s, safe + ' ') return s.replace(' ', '+') ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
Martin v. Löwis wrote: The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. But if someone wanted to do so, it's pretty simple: u'www.\u212bngstr\xf6m.com'.encode(idna) 'www.xn--ngstrm-hua5l.com' Robert Brewer [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. Another part I wasn't clear on is whether urllib.quote() understands it's working on URIs, arbitrary strings, URLs or what. It seems that from the documentation it looks like it's expecting to just work on the path component of URLs. If this is so, then it doesn't need to understand what to do if the IRI contains a hostname. Seems like the other somewhat under-specified part of all of this is how urllib.unquote() should work. If after percent decoding it sees non-ascii octets, should it try to decode them as utf-8 and if that fails then leave them as is? On May 7, 2008, at 11:55 AM, Robert Brewer wrote: Martin v. Löwis wrote: The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. But if someone wanted to do so, it's pretty simple: u'www.\u212bngstr\xf6m.com'.encode(idna) 'www.xn--ngstrm-hua5l.com' Robert Brewer [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. I haven't fully understood it, either, but I think that's the right conclusion. People want to fetch the resource, then, and encoding the host name in UTF-8 won't do much good. Seems like the other somewhat under-specified part of all of this is how urllib.unquote() should work. If after percent decoding it sees non-ascii octets, should it try to decode them as utf-8 and if that fails then leave them as is? That's why I think that using IRIs should be a separate feature, perhaps a separate module entirely. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
If this is indeed the case, it sounds perfectly legal (according to the RFC) and perfectly practical (as required by numerous popular websites) to have urllib.quote and urllib.quote_plus do an automatic UTF-8 encoding of unicode strings before percent encoding them. It's probably legal, but I don't understand why you think it's practical. The DNS lookup then will certainly fail, no? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
I was assuming urllib.quote/unquote would only be called on text intended to be used in non-hostname portions of the URIs. I'm not sure if this is the actual intent of urllib.quote and perhaps the documentation should be updated to specify what precisely it does and then peopel can decide what parts of URIs it is appropriate to quote/ unquote. I don't believe quote/unquote does anything sensical with hostnames today that contain non-printable ascii, so this is no loss of existing functionality. Re your suggestion that IRIs should be a separate module: I guess my thought is that urllib out of the box should just work with the way websites on the web today actually work. Thus, we should make urllib do the utf-8 encode / decode rather than make users switch to a different module for certain URLs and another library for other URLs. Re the specific issue of how urllib.unquote should work: Perhaps there could be an optional second argument that specified a content encoding to use when decoding escaped characters? I would propose that this parameter have a default value of utf-8 since that is what most websites seem to do, but if the author knew that the website they were using encoded URLs in iso-8559 then they could unquote using that scheme. On May 7, 2008, at 3:10 PM, Martin v. Löwis wrote: If this is indeed the case, it sounds perfectly legal (according to the RFC) and perfectly practical (as required by numerous popular websites) to have urllib.quote and urllib.quote_plus do an automatic UTF-8 encoding of unicode strings before percent encoding them. It's probably legal, but I don't understand why you think it's practical. The DNS lookup then will certainly fail, no? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] urllib unicode handling
Hi, While trying to use urllib in python 2.5.1 to HTTP GET content from various web sites, I've run into a problem with urllib.quote (and .quote_plus): they don't accept unicode strings. I see that this is an issue that has been discussed before: see this thread: http://mail.python.org/pipermail/python-dev/2006-July/067248.html especially this post: http://mail.python.org/pipermail/python-dev/2006-July/067335.html While I don't really want to re-open a can of worms, it seems that the current implementation of urllib.quote and urllib.quote_plus is painfully incompatible with how the web (circa 2008) actually works. While the standards may say there is no official way to represent unicode strings in URLs, in practice the world uses UTF-8 quite heavily. For example, I found the following URLs in Google pretty quickly by looking for percent encoded utf-8 encoded accented e's. http://www.last.fm/music/Jos%C3%A9+Gonz%C3%A1lez http://en.wikipedia.org/wiki/Joseph_Fouch%C3%A9 http://apps.facebook.com/ilike/artist/Jos%C3%A9+Gonz%C3%A1lez/track/Stay+In+The+Shade?apv=1 While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and Wikipedia seem to have embraced it (as have pretty much all other major web sites). As with HTML, there is what the standard says and what the actual browsers have to accept in order to work in the real world. urllib.urlencode already converts unicode characters to their UTF-8 representation before percent encoding them. Why not urllib.quote and urllib.quote_plus? Thanks for any thoughts on this, Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
Thanks for any thoughts on this, The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib unicode handling
-On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote: While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and Wikipedia seem to have embraced it (as have pretty much all other major web sites). As with HTML, there is what the standard says and what the actual browsers have to accept in order to work in the real world. I agree with you. The dictionary project I am working on (Dutch Japanese) uses in the URLs UTF-8 characters and things just worked with reasonably new browsers (at least no problems with Opera 9, Firefox 2 and 3, Internet Explorer 7 and Safari 3). Then later Armin Ronacher warned me that you still have to URL-escape these things in order to not be in lala-land. Would people object if such functionality got added to urllib? -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B If Winter comes, can Spring be far behind..? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com