Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Armin Ronacher
Hi, Jeroen Ruigrok van der Werven asmodai at in-nomine.org writes: Would people object if such functionality got added to urllib? I would ;-) There are IRIs, just that nobody wrote a useful module for that. There are algorithms in the RFC that can convert URIs to IRIs and the other way round.

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Kristján Valur Jónsson
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeroen Ruigrok van der Werven Sent: Wednesday, May 07, 2008 05:20 To: Tom Pinckney Cc: python-dev@python.org Subject: Re: [Python-Dev] urllib unicode handling -On [20080507 04:06], Tom Pinckney

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Robert Brewer
Martin v. Löwis wrote: The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. But if someone wanted to do so,

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. Another part I wasn't clear on is whether

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. I haven't fully understood it, either, but I

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Martin v. Löwis
If this is indeed the case, it sounds perfectly legal (according to the RFC) and perfectly practical (as required by numerous popular websites) to have urllib.quote and urllib.quote_plus do an automatic UTF-8 encoding of unicode strings before percent encoding them. It's probably legal, but I

Re: [Python-Dev] urllib unicode handling

2008-05-07 Thread Tom Pinckney
I was assuming urllib.quote/unquote would only be called on text intended to be used in non-hostname portions of the URIs. I'm not sure if this is the actual intent of urllib.quote and perhaps the documentation should be updated to specify what precisely it does and then peopel can decide

[Python-Dev] urllib unicode handling

2008-05-06 Thread Tom Pinckney
Hi, While trying to use urllib in python 2.5.1 to HTTP GET content from various web sites, I've run into a problem with urllib.quote (and .quote_plus): they don't accept unicode strings. I see that this is an issue that has been discussed before: see this thread:

Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Martin v. Löwis
Thanks for any thoughts on this, The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part. Code doing so just hasn't been contributed yet. Regards, Martin

Re: [Python-Dev] urllib unicode handling

2008-05-06 Thread Jeroen Ruigrok van der Werven
-On [20080507 04:06], Tom Pinckney ([EMAIL PROTECTED]) wrote: While in theory UTF-8 is not a standard, sites like Last.fm, Facebook and Wikipedia seem to have embraced it (as have pretty much all other major web sites). As with HTML, there is what the standard says and what the actual browsers