Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Fri, Mar 18, 2011 at 08:57:42PM -0400, Glyph Lefkowitz wrote: > Well, by RFC 398*7* they're calling them IRIs instead. 'irilib', perhaps? ;-) Yes, and it involves huge lot of unicode character handling /parsing rules in Resource Indentifiers. 'irilib' sounds like a good plan. -- Senthil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Mar 18, 2011, at 8:41 PM, Guido van Rossum wrote: > Really. Do they still call them URIs? :-) Well, by RFC 398*7* they're calling them IRIs instead. 'irilib', perhaps? ;-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Fri, Mar 18, 2011 at 5:32 PM, Nick Coghlan wrote: > On Fri, Mar 18, 2011 at 1:38 PM, Guido van Rossum wrote: >>> But seriously, I think an additional function or additional flag in the >>> current functions/method in the parse module is sufficient than going >>> for another module. >> >> I vote for a new function, not a flag. (Others can explain my rule of >> thumb against flag arguments whose values are nearly always >> constants.) > > When I was last tinkering with this (i.e. when I wrote that proof of > concept module for a fully RFC 3986 compliant parser), I actually > replaced the "urljoin" name with "resolve_uriref". > > So a minimal change to provide at least RFC 3986 joining semantics > would be to add a "resolve_uriref" that provides the RFC 3986 join > semantics, while "urljoin" would continue to follow the older RFCs. It's a bit long though -- users tend to flock to the shorter name. > There are additional niceties in RFC 3986 that it would be good to > provide, but that's when you start to get to the scale of a completely > new URI parsing module. Really. Do they still call them URIs? :-) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Fri, Mar 18, 2011 at 1:38 PM, Guido van Rossum wrote: >> But seriously, I think an additional function or additional flag in the >> current functions/method in the parse module is sufficient than going >> for another module. > > I vote for a new function, not a flag. (Others can explain my rule of > thumb against flag arguments whose values are nearly always > constants.) When I was last tinkering with this (i.e. when I wrote that proof of concept module for a fully RFC 3986 compliant parser), I actually replaced the "urljoin" name with "resolve_uriref". So a minimal change to provide at least RFC 3986 joining semantics would be to add a "resolve_uriref" that provides the RFC 3986 join semantics, while "urljoin" would continue to follow the older RFCs. There are additional niceties in RFC 3986 that it would be good to provide, but that's when you start to get to the scale of a completely new URI parsing module. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Thu, Mar 17, 2011 at 8:19 PM, Senthil Kumaran wrote: > Nick Coghlan wrote: >> > The problem is that it is quite a lot of work to get fully general URI >> > parsing to work correctly, but the overlap with legacy URL parsing is >> > large enough that many (most?) use cases in practice work just fine >> > with the older RFC semantics. > > Yes. We can have API which strictly confirms to latest RFC by > definition, but the problem is there is code out there which 'expects' > the parsing behavior remain unchanged so that their existing code does > not break. And with parsing behavior unchanged means conforming to > older RFC parsing rules. > > The solution seems to be extra function or an flag in the urlparse > method which will exhibit the more latest behavior. > > Guido wrote: > >> So would having two different API functions, one legacy and one >> conforming, be a problem? Ideally the conforming API's name would not >> be something lame like urllib2 but something timeless. :-) > > :-) Should blame Jeremy for that name!. But urllib2 is long replaced > by urllib.parse, urllib.request and urllib.response. Considering how > you remember urllib2, I think it's name has stood the test of time. It stood out like a sore thumb. :-) > But seriously, I think an additional function or additional flag in the > current functions/method in the parse module is sufficient than going > for another module. I vote for a new function, not a flag. (Others can explain my rule of thumb against flag arguments whose values are nearly always constants.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
Nick Coghlan wrote: > > The problem is that it is quite a lot of work to get fully general URI > > parsing to work correctly, but the overlap with legacy URL parsing is > > large enough that many (most?) use cases in practice work just fine > > with the older RFC semantics. Yes. We can have API which strictly confirms to latest RFC by definition, but the problem is there is code out there which 'expects' the parsing behavior remain unchanged so that their existing code does not break. And with parsing behavior unchanged means conforming to older RFC parsing rules. The solution seems to be extra function or an flag in the urlparse method which will exhibit the more latest behavior. Guido wrote: > So would having two different API functions, one legacy and one > conforming, be a problem? Ideally the conforming API's name would not > be something lame like urllib2 but something timeless. :-) :-) Should blame Jeremy for that name!. But urllib2 is long replaced by urllib.parse, urllib.request and urllib.response. Considering how you remember urllib2, I think it's name has stood the test of time. But seriously, I think an additional function or additional flag in the current functions/method in the parse module is sufficient than going for another module. -- Senthil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Wed, Mar 16, 2011 at 5:02 AM, Nick Coghlan wrote: > On Tue, Mar 15, 2011 at 11:34 PM, Guido van Rossum wrote: >> >> Can you be specific? What is different between those RFCs? > > I finally got around to trying to backport some of the additional > urljoin tests from http://bugs.python.org/issue1500504 (specifically, > the additional ones Mike Brown provided), but got tripped up by the > behavioural changes between the earlier RFCs and RFC 3986 regarding > the way ".." is handled. Ah, got it. > Even in test_urlparse, a bunch of the normative tests from RFC 3986 > are commented out because they fail (by design) when run through > urllib.parse.urljoin. Some of the additional tests also fail because > our urljoin implementation has a whitelist of schemas that support > relative references, whereas 3986 expects relative references to work > for unknown schemas as well. > > There's actually quite a few more terminology changes as well (as > Senthil pointed out in his email), but it was specifically the failing > test cases for urljoin semantics that bit me again yesterday. > > The problem is that it is quite a lot of work to get fully general URI > parsing to work correctly, but the overlap with legacy URL parsing is > large enough that many (most?) use cases in practice work just fine > with the older RFC semantics. So would having two different API functions, one legacy and one conforming, be a problem? Ideally the conforming API's name would not be something lame like urllib2 but something timeless. :-) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Tue, Mar 15, 2011 at 11:34 PM, Guido van Rossum wrote: > > Can you be specific? What is different between those RFCs? I finally got around to trying to backport some of the additional urljoin tests from http://bugs.python.org/issue1500504 (specifically, the additional ones Mike Brown provided), but got tripped up by the behavioural changes between the earlier RFCs and RFC 3986 regarding the way ".." is handled. Even in test_urlparse, a bunch of the normative tests from RFC 3986 are commented out because they fail (by design) when run through urllib.parse.urljoin. Some of the additional tests also fail because our urljoin implementation has a whitelist of schemas that support relative references, whereas 3986 expects relative references to work for unknown schemas as well. There's actually quite a few more terminology changes as well (as Senthil pointed out in his email), but it was specifically the failing test cases for urljoin semantics that bit me again yesterday. The problem is that it is quite a lot of work to get fully general URI parsing to work correctly, but the overlap with legacy URL parsing is large enough that many (most?) use cases in practice work just fine with the older RFC semantics. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Wed, Mar 16, 2011 at 12:03 AM, Senthil Kumaran wrote: > A new function, which can given this behavior is also a good idea. I'm strongly in favor of this approach. I know we've been bitten by changes made in the past, and have had to introduce Python-version specific handling. (I don't have the details handy, but vaguely recall the two versions involved being 2.4 and 2.6.) -Fred -- Fred L. Drake, Jr. "A storm broke loose in my mind." --Albert Einstein ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
Nick Coghlan wrote: > > Backwards compatible with *what* though? I meant the parsing 'behavior'. > For the decimal module, we treat deviations from spec as bug fixes and > update accordingly, even if this changes behaviour. > > For URL parsing, the spec has changed (6 years ago!), but we still > don't provide a spec-conformant implementation, even via a flag or new > function. If I understand correctly, by spec-comformant implementation, you mean having the parsed components denoted by the same terminology (as well as behavior) as written in the RFC3986. Like the example in the url denote: foo://example.com:8042/over/there?name=ferret#nose \_/ \__/\_/ \_/ \__/ | |||| scheme authority pathquery fragment | _|__ / \ /\ urn:example:animal:ferret:nose If I send the same url's via urlparse at the moment, I would get: >>> urlparse('foo://example.com:8042/over/there?name=ferret#nose') ParseResult(scheme='foo', netloc='example.com:8042', path='/over/there?name=ferret#nose', params='', query='', fragment='') >>> urlparse('urn:example:animal:ferret:nose') ParseResult(scheme='urn', netloc='', path='example:animal:ferret:nose', params='', query='', fragment='') The first one is because, we still have "old" scheme specific parsing behavior. Where foo is an unrecognized scheme so everything was classified under path. If we have valid scheme name, then the parsing behaviour would match the expectation. - A change to this would break the compatibility with older parsing behavior. Another point to note is naming - We use 'netloc' as part name loosely, where as 'authority' is correct term to use and then authority component has sub-parts. - I think, it is good to change this and adopt the RFC terminology more rigorously. I am +1 to any helpful improvement we can do in this module. But often it noticed that any slightest changes in parsing behavior has caused harm and brought us more bug-reports. A new function, which can given this behavior is also a good idea. -- Senthil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Tue, Mar 15, 2011 at 7:58 PM, Nick Coghlan wrote: > On Tue, Mar 15, 2011 at 7:14 PM, Senthil Kumaran wrote: >> On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan wrote: >>> With RFC 3986 passing its 6th birthday, and with it being well past >>> its 7th by the time 3.3 comes out, can we finally switch to supporting >>> the current semantics rather than the obsolete behaviour? >> >> We do infact, support RFC 3986, expect for the cases where those >> conflict with the previous RFCs. (IOW, backwards compatible). >> The tests can give you a good picture here. Do you mean, we should >> just do away with backwards compatibility? Or you had anything else >> specifically in mind? > > Backwards compatible with *what* though? > > For the decimal module, we treat deviations from spec as bug fixes and > update accordingly, even if this changes behaviour. > > For URL parsing, the spec has changed (6 years ago!), but we still > don't provide a spec-conformant implementation, even via a flag or new > function. Can you be specific? What is different between those RFCs? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Tue, Mar 15, 2011 at 7:14 PM, Senthil Kumaran wrote: > On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan wrote: >> With RFC 3986 passing its 6th birthday, and with it being well past >> its 7th by the time 3.3 comes out, can we finally switch to supporting >> the current semantics rather than the obsolete behaviour? > > We do infact, support RFC 3986, expect for the cases where those > conflict with the previous RFCs. (IOW, backwards compatible). > The tests can give you a good picture here. Do you mean, we should > just do away with backwards compatibility? Or you had anything else > specifically in mind? Backwards compatible with *what* though? For the decimal module, we treat deviations from spec as bug fixes and update accordingly, even if this changes behaviour. For URL parsing, the spec has changed (6 years ago!), but we still don't provide a spec-conformant implementation, even via a flag or new function. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
On Wed, Mar 16, 2011 at 7:01 AM, Nick Coghlan wrote: > With RFC 3986 passing its 6th birthday, and with it being well past > its 7th by the time 3.3 comes out, can we finally switch to supporting > the current semantics rather than the obsolete behaviour? We do infact, support RFC 3986, expect for the cases where those conflict with the previous RFCs. (IOW, backwards compatible). The tests can give you a good picture here. Do you mean, we should just do away with backwards compatibility? Or you had anything else specifically in mind? -- Senthil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Finally switch urllib.parse to RFC3986 semantics?
For years, urlparse (and subsequently urlib.parse) has opted to implement the semantics from the older URL processing RFCs, rather than updating to the new semantics as the RFCs are superseded. With RFC 3986 passing its 6th birthday, and with it being well past its 7th by the time 3.3 comes out, can we finally switch to supporting the current semantics rather than the obsolete behaviour? Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com