Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Armin Ronacher
Hi,

Jeroen Ruigrok van der Werven asmodai at in-nomine.org writes:

 Would people object if such functionality got added to urllib?
I would ;-)  There are IRIs, just that nobody wrote a useful module for that. 
There are algorithms in the RFC that can convert URIs to IRIs and the other way
round.  IMO that's the way to go.

Regards,
Armin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Kristján Valur Jónsson
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf
 Of Jeroen Ruigrok van der Werven
 Sent: Wednesday, May 07, 2008 05:20
 To: Tom Pinckney
 Cc: python-dev@python.org
 Subject: Re: [Python-Dev] urllib unicode handling

 -On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote:
 While in theory UTF-8 is not a standard, sites like Last.fm, Facebook
 and
 Wikipedia seem to have embraced it (as have pretty much all other
 major web
 sites). As with HTML, there is what the standard says and what the
 actual
 browsers have to accept in order to work in the real world.


FYI, here is how we have patched urrlib2 for use in EVE:

--- C:\p4\sdk\stackless25\Lib\urllib.py 2008-03-21 14:47:23.0 -
+++ C:\p4\eve\KALI\common\stdlib\urllib.py  2007-11-06 11:18:01.0 
-
@@ -1158,12 +1158,29 @@
 except KeyError:
 res[i] = '%' + item
 except UnicodeDecodeError:
 res[i] = unichr(int(item[:2], 16)) + item[2:]
 return .join(res)

+unquote_inner = unquote
+def unquote(s):
+CCP attempt at making sensible choices in unicode quoteing / unquoting 

+s = unquote_inner(s)
+try:
+u = s.decode(utf-8)
+try:
+s2 = s.decode(ascii)
+except UnicodeDecodeError:
+s = u #yes, s was definitely utf8, which isn't pure ascii
+else:
+if u != s:
+s = u
+except UnicodeDecodeError:
+pass  #can't have been utf8
+return s
+
 def unquote_plus(s):
 unquote('%7e/abc+def') - '~/abc def'
 s = s.replace('+', ' ')
 return unquote(s)

 always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
@@ -1201,12 +1218,20 @@
 for i in range(256):
 c = chr(i)
 safe_map[c] = (c in safe) and c or ('%%%02X' % i)
 _safemaps[cachekey] = safe_map
 res = map(safe_map.__getitem__, s)
 return ''.join(res)
+
+quote_inner = quote
+def quote(s, safe = '/'):
+CCP addition, to try to sensibly support / circumvent issues with 
unicode in urls
+try:
+return quote_inner(s, safe)
+except KeyError:
+return quote_inner(s.encode(utf-8, safe))

 def quote_plus(s, safe = ''):
 Quote the query fragment of a URL; replacing ' ' with '+'
 if ' ' in s:
 s = quote(s, safe + ' ')
 return s.replace(' ', '+')
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Robert Brewer
Martin v. Löwis wrote:
 The proper way to implement this would be IRIs (RFC 3987),
 in particular section 3.1. This is not as simple as just
 encoding it as UTF-8, as you might have to apply IDNA to
 the host part.
 
 Code doing so just hasn't been contributed yet.

But if someone wanted to do so, it's pretty simple:

 u'www.\u212bngstr\xf6m.com'.encode(idna)
'www.xn--ngstrm-hua5l.com'


Robert Brewer
[EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
Maybe I didn't understand the RFC quite right, but it seemed like how  
to handle hostnames was left as a choice between IDNA encoding the  
hostname or replacing the non-ascii characters with dashes? I guess in  
practice IDNA is the right decision.


Another part I wasn't clear on is whether urllib.quote() understands  
it's working on URIs, arbitrary strings, URLs or what. It seems that  
from the documentation it looks like it's expecting to just work on  
the path component of URLs. If this is so, then it doesn't need to  
understand what to do if the IRI contains a hostname.


Seems like the other somewhat under-specified part of all of this is  
how urllib.unquote() should work. If after percent decoding it sees  
non-ascii octets, should it try to decode them as utf-8 and if that  
fails then leave them as is?


On May 7, 2008, at 11:55 AM, Robert Brewer wrote:


Martin v. Löwis wrote:

The proper way to implement this would be IRIs (RFC 3987),
in particular section 3.1. This is not as simple as just
encoding it as UTF-8, as you might have to apply IDNA to
the host part.

Code doing so just hasn't been contributed yet.


But if someone wanted to do so, it's pretty simple:


u'www.\u212bngstr\xf6m.com'.encode(idna)

'www.xn--ngstrm-hua5l.com'


Robert Brewer
[EMAIL PROTECTED]



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
 Maybe I didn't understand the RFC quite right, but it seemed like how to
 handle hostnames was left as a choice between IDNA encoding the hostname
 or replacing the non-ascii characters with dashes? I guess in practice
 IDNA is the right decision.

I haven't fully understood it, either, but I think that's the right
conclusion. People want to fetch the resource, then, and encoding the
host name in UTF-8 won't do much good.

 Seems like the other somewhat under-specified part of all of this is how
 urllib.unquote() should work. If after percent decoding it sees
 non-ascii octets, should it try to decode them as utf-8 and if that
 fails then leave them as is?

That's why I think that using IRIs should be a separate feature,
perhaps a separate module entirely.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
 If this is indeed the case, it sounds perfectly legal (according to the
 RFC) and perfectly practical (as required by numerous popular websites)
 to have urllib.quote and urllib.quote_plus do an automatic UTF-8
 encoding of unicode strings before percent encoding them.

It's probably legal, but I don't understand why you think it's
practical. The DNS lookup then will certainly fail, no?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I was assuming urllib.quote/unquote would only be called on text  
intended to be used in non-hostname portions of the URIs. I'm not sure  
if this is the actual intent of urllib.quote and perhaps the  
documentation should be updated to specify what precisely it does and  
then peopel can decide what parts of URIs it is appropriate to quote/ 
unquote. I don't believe quote/unquote does anything sensical with  
hostnames today that contain non-printable ascii, so this is no loss  
of existing functionality.


Re your suggestion that IRIs should be a separate module: I guess my  
thought is that urllib out of the box should just work with the way  
websites on the web today actually work. Thus, we should make urllib  
do the utf-8 encode / decode rather than make users switch to a  
different module for certain URLs and another library for other URLs.


Re the specific issue of how urllib.unquote should work: Perhaps there  
could be an optional second argument that specified a content encoding  
to use when decoding escaped characters? I would propose that this  
parameter have a default value of utf-8 since that is what most  
websites seem to do, but if the author knew that the website they were  
using encoded URLs in iso-8559 then they could unquote using that  
scheme.


On May 7, 2008, at 3:10 PM, Martin v. Löwis wrote:

If this is indeed the case, it sounds perfectly legal (according to  
the
RFC) and perfectly practical (as required by numerous popular  
websites)

to have urllib.quote and urllib.quote_plus do an automatic UTF-8
encoding of unicode strings before percent encoding them.


It's probably legal, but I don't understand why you think it's
practical. The DNS lookup then will certainly fail, no?

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] urllib unicode handling

2008-05-06 Thread Tom Pinckney

Hi,

While trying to use urllib in python 2.5.1 to HTTP GET content from  
various web sites, I've run into a problem with urllib.quote  
(and .quote_plus): they don't accept unicode strings.


I see that this is an issue that has been discussed before:

see this thread: 
http://mail.python.org/pipermail/python-dev/2006-July/067248.html
especially this post: 
http://mail.python.org/pipermail/python-dev/2006-July/067335.html

While I don't really want to re-open a can of worms, it seems that the  
current implementation of urllib.quote and urllib.quote_plus is  
painfully incompatible with how the web (circa 2008) actually works.  
While the standards may say there is no official way to represent  
unicode strings in URLs, in practice the world uses UTF-8 quite  
heavily. For example, I found the following URLs in Google pretty  
quickly by looking for percent encoded utf-8 encoded accented e's.


http://www.last.fm/music/Jos%C3%A9+Gonz%C3%A1lez
http://en.wikipedia.org/wiki/Joseph_Fouch%C3%A9

http://apps.facebook.com/ilike/artist/Jos%C3%A9+Gonz%C3%A1lez/track/Stay+In+The+Shade?apv=1

While in theory UTF-8 is not a standard, sites like Last.fm, Facebook  
and Wikipedia seem to have embraced it (as have pretty much all other  
major web sites). As with HTML, there is what the standard says and  
what the actual browsers have to accept in order to work in the real  
world.


urllib.urlencode already converts unicode characters to their UTF-8  
representation before percent encoding them. Why not urllib.quote and  
urllib.quote_plus?


Thanks for any thoughts on this,

Tom








___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Martin v. Löwis
 Thanks for any thoughts on this,

The proper way to implement this would be IRIs (RFC 3987),
in particular section 3.1. This is not as simple as just
encoding it as UTF-8, as you might have to apply IDNA to
the host part.

Code doing so just hasn't been contributed yet.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Jeroen Ruigrok van der Werven
-On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote:
While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and 
Wikipedia seem to have embraced it (as have pretty much all other major web 
sites). As with HTML, there is what the standard says and what the actual 
browsers have to accept in order to work in the real world.

I agree with you. The dictionary project I am working on (Dutch  Japanese)
uses in the URLs UTF-8 characters and things just worked with reasonably new
browsers (at least no problems with Opera 9, Firefox 2 and 3, Internet
Explorer 7 and Safari 3). Then later Armin Ronacher warned me that you still
have to URL-escape these things in order to not be in lala-land.

Would people object if such functionality got added to urllib?

-- 
Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
If Winter comes, can Spring be far behind..?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com