Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-18 Thread Senthil Kumaran
On Fri, Mar 18, 2011 at 08:57:42PM -0400, Glyph Lefkowitz wrote:

> Well, by RFC 398*7* they're calling them IRIs instead.  'irilib', perhaps? ;-)

Yes, and it involves huge lot of unicode character handling /parsing
rules in Resource Indentifiers. 'irilib' sounds like a good plan.

-- 
Senthil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-18 Thread Glyph Lefkowitz

On Mar 18, 2011, at 8:41 PM, Guido van Rossum wrote:

> Really. Do they still call them URIs? :-)

Well, by RFC 398*7* they're calling them IRIs instead.  'irilib', perhaps? ;-)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-18 Thread Guido van Rossum
On Fri, Mar 18, 2011 at 5:32 PM, Nick Coghlan  wrote:
> On Fri, Mar 18, 2011 at 1:38 PM, Guido van Rossum  wrote:
>>> But seriously, I think an additional function or additional flag in the
>>> current functions/method in the parse module is sufficient than going
>>> for another module.
>>
>> I vote for a new function, not a flag. (Others can explain my rule of
>> thumb against flag arguments whose values are nearly always
>> constants.)
>
> When I was last tinkering with this (i.e. when I wrote that proof of
> concept module for a fully RFC 3986 compliant parser), I actually
> replaced the "urljoin" name with "resolve_uriref".
>
> So a minimal change to provide at least RFC 3986 joining semantics
> would be to add a "resolve_uriref" that provides the RFC 3986 join
> semantics, while "urljoin" would continue to follow the older RFCs.

It's a bit long though -- users tend to flock to the shorter name.

> There are additional niceties in RFC 3986 that it would be good to
> provide, but that's when you start to get to the scale of a completely
> new URI parsing module.

Really. Do they still call them URIs? :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-18 Thread Nick Coghlan
On Fri, Mar 18, 2011 at 1:38 PM, Guido van Rossum  wrote:
>> But seriously, I think an additional function or additional flag in the
>> current functions/method in the parse module is sufficient than going
>> for another module.
>
> I vote for a new function, not a flag. (Others can explain my rule of
> thumb against flag arguments whose values are nearly always
> constants.)

When I was last tinkering with this (i.e. when I wrote that proof of
concept module for a fully RFC 3986 compliant parser), I actually
replaced the "urljoin" name with "resolve_uriref".

So a minimal change to provide at least RFC 3986 joining semantics
would be to add a "resolve_uriref" that provides the RFC 3986 join
semantics, while "urljoin" would continue to follow the older RFCs.

There are additional niceties in RFC 3986 that it would be good to
provide, but that's when you start to get to the scale of a completely
new URI parsing module.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-17 Thread Guido van Rossum
On Thu, Mar 17, 2011 at 8:19 PM, Senthil Kumaran  wrote:
> Nick Coghlan wrote:
>> > The problem is that it is quite a lot of work to get fully general URI
>> > parsing to work correctly, but the overlap with legacy URL parsing is
>> > large enough that many (most?) use cases in practice work just fine
>> > with the older RFC semantics.
>
> Yes. We can have API which strictly confirms to latest RFC by
> definition, but the problem is there is code out there which 'expects'
> the parsing behavior remain unchanged so that their existing code does
> not break. And with parsing behavior unchanged means conforming to
> older RFC parsing rules.
>
> The solution seems to be extra function or an flag in the urlparse
> method which will exhibit the more latest behavior.
>
> Guido wrote:
>
>> So would having two different API functions, one legacy and one
>> conforming, be a problem? Ideally the conforming API's name would not
>> be something lame like urllib2 but something timeless. :-)
>
> :-) Should blame Jeremy for that name!. But urllib2 is long replaced
> by urllib.parse, urllib.request and urllib.response. Considering how
> you remember urllib2, I think it's name has stood the test of time.

It stood out like a sore thumb. :-)

> But seriously, I think an additional function or additional flag in the
> current functions/method in the parse module is sufficient than going
> for another module.

I vote for a new function, not a flag. (Others can explain my rule of
thumb against flag arguments whose values are nearly always
constants.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-17 Thread Senthil Kumaran
Nick Coghlan wrote:
> > The problem is that it is quite a lot of work to get fully general URI
> > parsing to work correctly, but the overlap with legacy URL parsing is
> > large enough that many (most?) use cases in practice work just fine
> > with the older RFC semantics.

Yes. We can have API which strictly confirms to latest RFC by
definition, but the problem is there is code out there which 'expects'
the parsing behavior remain unchanged so that their existing code does
not break. And with parsing behavior unchanged means conforming to
older RFC parsing rules.

The solution seems to be extra function or an flag in the urlparse
method which will exhibit the more latest behavior.

Guido wrote:

> So would having two different API functions, one legacy and one
> conforming, be a problem? Ideally the conforming API's name would not
> be something lame like urllib2 but something timeless. :-)

:-) Should blame Jeremy for that name!. But urllib2 is long replaced
by urllib.parse, urllib.request and urllib.response. Considering how
you remember urllib2, I think it's name has stood the test of time.

But seriously, I think an additional function or additional flag in the
current functions/method in the parse module is sufficient than going
for another module.

-- 
Senthil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-17 Thread Guido van Rossum
On Wed, Mar 16, 2011 at 5:02 AM, Nick Coghlan  wrote:
> On Tue, Mar 15, 2011 at 11:34 PM, Guido van Rossum  wrote:
>>
>> Can you be specific? What is different between those RFCs?
>
> I finally got around to trying to backport some of the additional
> urljoin tests from http://bugs.python.org/issue1500504 (specifically,
> the additional ones Mike Brown provided), but got tripped up by the
> behavioural changes between the earlier RFCs and RFC 3986 regarding
> the way ".." is handled.

Ah, got it.

> Even in test_urlparse, a bunch of the normative tests from RFC 3986
> are commented out because they fail (by design) when run through
> urllib.parse.urljoin. Some of the additional tests also fail because
> our urljoin implementation has a whitelist of schemas that support
> relative references, whereas 3986 expects relative references to work
> for unknown schemas as well.
>
> There's actually quite a few more terminology changes as well (as
> Senthil pointed out in his email), but it was specifically the failing
> test cases for urljoin semantics that bit me again yesterday.
>
> The problem is that it is quite a lot of work to get fully general URI
> parsing to work correctly, but the overlap with legacy URL parsing is
> large enough that many (most?) use cases in practice work just fine
> with the older RFC semantics.

So would having two different API functions, one legacy and one
conforming, be a problem? Ideally the conforming API's name would not
be something lame like urllib2 but something timeless. :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-16 Thread Nick Coghlan
On Tue, Mar 15, 2011 at 11:34 PM, Guido van Rossum  wrote:
>
> Can you be specific? What is different between those RFCs?

I finally got around to trying to backport some of the additional
urljoin tests from http://bugs.python.org/issue1500504 (specifically,
the additional ones Mike Brown provided), but got tripped up by the
behavioural changes between the earlier RFCs and RFC 3986 regarding
the way ".." is handled.

Even in test_urlparse, a bunch of the normative tests from RFC 3986
are commented out because they fail (by design) when run through
urllib.parse.urljoin. Some of the additional tests also fail because
our urljoin implementation has a whitelist of schemas that support
relative references, whereas 3986 expects relative references to work
for unknown schemas as well.

There's actually quite a few more terminology changes as well (as
Senthil pointed out in his email), but it was specifically the failing
test cases for urljoin semantics that bit me again yesterday.

The problem is that it is quite a lot of work to get fully general URI
parsing to work correctly, but the overlap with legacy URL parsing is
large enough that many (most?) use cases in practice work just fine
with the older RFC semantics.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Fred Drake
On Wed, Mar 16, 2011 at 12:03 AM, Senthil Kumaran  wrote:
> A new function, which can given this behavior is also a good idea.

I'm strongly in favor of this approach.  I know we've been bitten by
changes made in the past, and have had to introduce Python-version
specific handling.  (I don't have the details handy, but vaguely
recall the two versions involved being 2.4 and 2.6.)


  -Fred

--
Fred L. Drake, Jr.    
"A storm broke loose in my mind."  --Albert Einstein
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Senthil Kumaran
Nick Coghlan wrote:
> 
> Backwards compatible with *what* though?

I meant the parsing 'behavior'.

> For the decimal module, we treat deviations from spec as bug fixes and
> update accordingly, even if this changes behaviour.
> 
> For URL parsing, the spec has changed (6 years ago!), but we still
> don't provide a spec-conformant implementation, even via a flag or new
> function.

If I understand correctly, by spec-comformant implementation, you mean
having the parsed components denoted by the same terminology (as well
as behavior) as written in the RFC3986. 

Like the example in the url denote:


 foo://example.com:8042/over/there?name=ferret#nose
 \_/   \__/\_/ \_/ \__/
  |   ||||
   scheme authority   pathquery   fragment
  |   _|__
 / \ /\
 urn:example:animal:ferret:nose

If I send the same url's via urlparse at the moment, I would get:

>>> urlparse('foo://example.com:8042/over/there?name=ferret#nose')
ParseResult(scheme='foo', netloc='example.com:8042', 
path='/over/there?name=ferret#nose', params='', query='', fragment='')
>>> urlparse('urn:example:animal:ferret:nose')
ParseResult(scheme='urn', netloc='', path='example:animal:ferret:nose', 
params='', query='', fragment='')

The first one is because, we still have "old" scheme specific parsing behavior.
Where foo is an unrecognized scheme so everything was classified under path. If
we have valid scheme name, then the parsing behaviour would match the
expectation.

- A change to this would break the compatibility with older parsing behavior.

Another point to note is naming - We use 'netloc' as part name loosely, where
as 'authority' is correct term to use and then authority component has
sub-parts.  

- I think, it is good to change this and adopt the RFC terminology more 
rigorously.

I am +1 to any helpful improvement we can do in this module. But often it
noticed that any slightest changes in parsing behavior has caused harm and
brought us more bug-reports.

A new function, which can given this behavior is also a good idea.

-- 
Senthil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Guido van Rossum
On Tue, Mar 15, 2011 at 7:58 PM, Nick Coghlan  wrote:
> On Tue, Mar 15, 2011 at 7:14 PM, Senthil Kumaran  wrote:
>> On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan  wrote:
>>> With RFC 3986 passing its 6th birthday, and with it being well past
>>> its 7th by the time 3.3 comes out, can we finally switch to supporting
>>> the current semantics rather than the obsolete behaviour?
>>
>> We do infact, support RFC 3986, expect for the cases where those
>> conflict with the previous RFCs. (IOW, backwards compatible).
>> The tests can give you a good picture here. Do you mean, we should
>> just do away with backwards  compatibility? Or you had anything else
>> specifically in mind?
>
> Backwards compatible with *what* though?
>
> For the decimal module, we treat deviations from spec as bug fixes and
> update accordingly, even if this changes behaviour.
>
> For URL parsing, the spec has changed (6 years ago!), but we still
> don't provide a spec-conformant implementation, even via a flag or new
> function.

Can you be specific? What is different between those RFCs?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Nick Coghlan
On Tue, Mar 15, 2011 at 7:14 PM, Senthil Kumaran  wrote:
> On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan  wrote:
>> With RFC 3986 passing its 6th birthday, and with it being well past
>> its 7th by the time 3.3 comes out, can we finally switch to supporting
>> the current semantics rather than the obsolete behaviour?
>
> We do infact, support RFC 3986, expect for the cases where those
> conflict with the previous RFCs. (IOW, backwards compatible).
> The tests can give you a good picture here. Do you mean, we should
> just do away with backwards  compatibility? Or you had anything else
> specifically in mind?

Backwards compatible with *what* though?

For the decimal module, we treat deviations from spec as bug fixes and
update accordingly, even if this changes behaviour.

For URL parsing, the spec has changed (6 years ago!), but we still
don't provide a spec-conformant implementation, even via a flag or new
function.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Senthil Kumaran
On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan  wrote:
> With RFC 3986 passing its 6th birthday, and with it being well past
> its 7th by the time 3.3 comes out, can we finally switch to supporting
> the current semantics rather than the obsolete behaviour?

We do infact, support RFC 3986, expect for the cases where those
conflict with the previous RFCs. (IOW, backwards compatible).
The tests can give you a good picture here. Do you mean, we should
just do away with backwards  compatibility? Or you had anything else
specifically in mind?

-- 
Senthil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

2011-03-15 Thread Nick Coghlan
For years, urlparse (and subsequently urlib.parse) has opted to
implement the semantics from the older URL processing RFCs, rather
than updating to the new semantics as the RFCs are superseded.

With RFC 3986 passing its 6th birthday, and with it being well past
its 7th by the time 3.3 comes out, can we finally switch to supporting
the current semantics rather than the obsolete behaviour?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com