Re: [Python-Dev] Fwd: RFC - GoogleSOC proposal -cleanupurllib
Senthil Kumaran wrote: I have written a proposal to cleanup urllib as part of Google SoC. I am attaching the file 'soc1' with this email. Requesting you to go through the proposal and provide any feedback which I can incorporate in my submission. From your proposal: 2) In all modules, Follow the new RFC 2396 in favour of RFC 1738 and RFC 1808. [...] In all modules, follow the new RFC 2396 in favor of RFC 1738, RFC 1808. The standards for URI described in RFC 2396 is different from older RFCs and urllib, urllib2 modules implement the URL specifications based on the older URL specification. This will need changes in urlparse and other parse modules to handle URLS as specified in the RFC2396. The new RFC 2396 was superseded by STD 66 (RFC 3986) two years ago. Your failure to notice this development doesn't bode well :) j/k, although it does undermine confidence somewhat. I think the bugfixes sound great, but major enhancements and API refactorings need to be undertaken more cautiously. In any case, I have a few suggestions: - Read http://en.wikipedia.org/wiki/Uniform_Resource_Identifier. (I wrote the majority of it, and got peer review from the URI WG a while back). - Read http://en.wikipedia.org/wiki/Percent_encoding. (I wrote most of this too). - Familiarize yourself with STD 66. (i.e., don't trust anything I wrote ;)) Especially note its differences from RFC 2396 (summarized in an appendix). - Seek peer review for any changes that you attribute to changing standards. In my experience implementing a general-purpose URI processing library (http://cvs.4suite.org/viewcvs/4Suite/Ft/Lib/Uri.py?view=markup ), there were times when I thought the standard was saying a bit more than it really was, especially when it came to percent-encoding, which has several somewhat-conflicting conventions and standards governing it. I tried to cover these in the Wikipedia article. - Anticipate real-world use cases. If you go down the road of doing what the standards recommend (be aware of should vs must and whether it's directed at URI producers or consumers), you might lose sight of the fact that there's a reason, for example, people use encodings other than the recommended UTF-8 as the basis for percent-encoding. Similarly, expectations surrounding the behavior of 'file' URIs and path-portions thereof are sometimes less than optimal in the real world. If you're designing an API, be flexible, and seek review for any compatibilities you intend to introduce. - Be aware of the fact that people might have different expectations when they use different string types (unicode, str) in URI processing, and different levels of awareness of the levels of abstraction at which URI processing operates. It can be difficult to uniformly handle unicode and str. And then there's IRIs (RFC 3987)... For additional background, you might also check the python-dev discussion of urllib in Sep 2004, urlparse in Nov 2005, and the competing uriparse.py proposals (Apr, Jun 2006). Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
Stefan Rank wrote: Well, originally, I would have expected it to return a byte str(ing), I'd assume unicode in, unicode out, and str in, str out, but since it's always going to produce ASCII-range characters, it wouldn't matter. Moot point anyway. BUT I am now converted and think it is best to raise a TypeError for unicode, and leave the encoding decisions to higher level code. So I'll repeat the patch #1, slightly modified:: if isinstance(s, unicode): raise TypeError(quote expects an encoded byte string as argument) Is it safe to assume that code that breaks because of this change was already broken? Yes. The patch seems fine to me, although maybe if not isinstance(s, str) would be better? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
Stefan Rank wrote: on 12.07.2006 07:53 Martin v. Löwis said the following: Anthony Baxter wrote: The right thing to do is IRIs. For 2.5, should we at least detect that it's unicode and raise a useful error? That can certainly be done, sure. Martin That would be great. And I agree that updating urllib.quote for unicode should be part of a grand plan that updates all of urllib[2] and introduces an irilib / urischemes / uriparse module in 2.6 as Martin and John J Lee suggested. =) cheers, stefan Put me down as +1 on raising a useful error instead of a KeyError or whatever, and +1 on having an irilib, but -1 on working toward accepting unicode in the URI-oriented urllib.quote(), because (a.) user expectations for strings that contain non-ASCII-range characters will vary, and (b.) percent-encoding is supposed to only operate on a byte-encoded view of non-URI information, not the information itself. Longer explanation: I, too, initially thought that quote() was outdated since it choked on unicode input, but eventually I came to realize that it's wise to reject such input, because to attempt to percent-encode characters, rather than bytes, reflects a fundamental misunderstanding of the level at which percent-encoding is intended to operate. This is one of the hardest aspects of URI processing to grok, and I'm not very good at explaining it, even though I've tried my best in the Wikipedia articles. It's basically these 3 points: 1. A URI can only consist of 'unreserved' characters, as I'm sure you know. It's a specific set that has varied slightly over the years, and is a subset of printable ASCII. 2. A URI scheme is essentially a mapping of non-URI information to a sequence of URI characters. That is, it is a method of producing a URI from non-URI information within a particular information domain ...and vice-versa. 3. A URI scheme should (though may not do so very clearly, especially the older it is!) tell you that the way to represent a particular bit of non-URI information, 'info', in a URI is to convert_to_bytes(info), and then, as per STD 66, make the bytes that correspond, in ASCII, to unreserved characters manifest as those characters, and all others manifest as their percent-encoded equivalents. In urllib parlance, this step is 'quoting' the bytes. 3.1. [This isn't crucial to my argument, but has to be mentioned to complete the explanation of percent-encoding.] In addition, those bytes corresponding, in ASCII, to some 'reserved' characters are exempt from needing to be percent-encoded, so long as they're not being used for their reserved purpose (if any) in whatever URI component they're going into -- Semantically, there's no difference between such bytes when expressed in the URI as a literal reserved character or as a percent-encoded byte. URI scheme specs vary greatly in how they deal with this nuance. In any case, urllib.quote() has the 'safe' argument which can be used to specify the exempt reserved characters. In the days when the specs that urllib was based on were relevant, 99% of the time, the bytes being 'quoted' were ASCII-encoded strings representing ASCII character-based non-URI information, so quite a few of us, including many URI scheme authors, were tempted to think that what was being 'quoted'/percent-encoded *was* the original non-URI information, rather than a bytewise view of it mandated by a URI scheme. That's what I was doing when I thought that quote(some_unicode_path) should 'work', especially in light of Python's treat all strings alike guideline. But if you accept all of the above, which is what I believe the standard requires, then unicode input is a very different situation from str input; it's unclear whether and how the caller wants the input to be converted to bytes, if they even understand what they're doing at all. See, right now, quote('abc 123%') returns 'abc%20123%25', as you would expect. Similarly, everyone would probably expect u'abc 123%' to return u'abc%20123%25', and if we were to implement that, there'd probably be no harm done. But look at quote('\xb7'), which, assuming you accept everything I've said above is correct, rightfully returns '%B7'. What would someone expect quote(u'\xb7') to return? Some might want u'%B7' because they want the same result type as the input they gave, with no other changes from how it would normally be handled. Some might want u'%C2%B7' because they're conflating the levels of abstraction and expect, say, UTF-8 conversion to be done on their input. Some (like me) might want a TypeError or ValueError because we shouldn't be handing such ambiguous data to quote() in the first place. And then there's the u'\u0100'-and-up input to worry about; what does a user expect to be done with that? I would prefer to see quote() always reject unicode input with a TypeError. Alternatively, if it accepts unicode, it should produce unicode, and since it can only reasonably assume what
Re: [Python-Dev] UUID module
Fredrik Lundh wrote: Ka-Ping Yee wrote: Quite a few people have expressed interest in having UUID functionality in the standard library, and previously on this list some suggested possibly using the uuid.py module i wrote: http://zesty.ca/python/uuid.py +1! +1 as well. I have a couple of suggestions for improving that implementation: 1. You're currently using os.urandom, which can raise a NotImplementedError. You should be prepared to fall back on a different PRNG... which leads to the 2nd suggestion: 2. random.randrange is a method on a default random.Random instance that, although seeded by urandom (if available), may not be the user's preferred PRNG. I recommend making it possible for the user to supply their own random.Random instance for use by the module. That's all. :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Some more comments re new uriparse module, patch 1462525
John J Lee wrote: http://python.org/sf/1500504 [...] At first glance, looks good. I hope to review it properly later. One point: I don't think there should be any mention of URL in the module -- we should use URI everywhere (see my comments on Paul's original version for a bit more on this). Agreed. Although you've added the test cases from 4Suite and credited me for them, only a few of the test cases were invented by me. I'd rather you credited them to their original sources, as I did. Also, I believe Graham Klyne has been adding some new cases to his Haskell tools, but hasn't been updating the other spreadsheet and RDF files in which he publishes them in a more usable form. My tests only use what's in the spreadsheet, so I've only got 88 out of 99 testRelative cases from http://cvs.haskell.org/cgi-bin/cvsweb.cgi/fptools/libraries/network/tests/URITest.hs So if you really want to be thorough, grab the missing cases from there. - It appears that Paul uploaded a new version of his library on June 3: http://python.org/sf/1462525 I'm unclear on the relationship between the two now. Are they both up for consideration? - One thing I forgot to mention in private email is that I'm concerned that the inclusion of URI reference resolution functionality has exceeded the scope of this 'urischemes' module that purports to be for 'extensible URI parsing'. It is becoming a scheme-aware and general-purpose syntactic processing library for URIs, and should be characterized as such in its name as well as in its documentation. Even without a new name and more accurately documented scope, people are going to see no reason not to add the rest of STD 66's functionality to it (percent-encoding, normalization for testing equivalence, syntax validation...). As you can see in Ft.Lib.Uri, the latter two are not at all hard to implement, especially if you use regular expressions. These all fall under syntactic operations on URIs, just like reference-resolution. Percent-encoding gets very hairy with its API details due to application-level uses that don't jive with STD 66 (e.g. the fuzzy specs and convoluted history governing application/x-www-form-urlencoded), the nuances of character encoding and Python string types, and widely varying expectations of users. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New Pythondoc by effbot
BJ Why does it have to be wiki-like? Why can't it be a wiki? MediaWiki seem to work pretty well for a lot of software projects that have put their documentation in a wiki. Talk pages for commentary and primary pages for reviewed content. And inconsistent formatting from article to article, limitations in indexing options, no way to browse a set of documentation specific to a particular Python release... I personally like the PHP docs - http://www.php.net/manual/en/ They're not versioned, and users can't modify the indexes or API docs, but they do get to add annotations. Annotations that reflect errors or major omissions from the docs can be reviewed by editors and folded into the docs as needed. *shrug* ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] ElementTree in stdlib
Catching up on some python-dev email, I was surprised to see that things seem to be barrelling ahead with the adding of ElementTree to Python core without any discussion on XML-SIG. Sidestepping XML-SIG and the proving grounds of PyXML in order to satsify the demand for a Pythonic databinding+API for XML in stdlib seems to be a bit of a raised middle finger to those folks who have worked hard on competing or differently-scoped APIs, each of which deserves a bit more peer review than just a single nomination on python-dev, which seems to be all it took to obtain a blessing for ElementTree. I have nothing against ElementTree, and would like to see more XML processing options in core, but it seems to me like the XML-SIG is being deliberately left out of this process. Just last month, Guido submitted to XML-SIG a Pythonic XML API that he had been tinkering with.[1] I don't think anyone was really bold enough to tell him what they really thought of it (other than that it is a lot like XIST), but it was admirable that he put it up for peer review rather than just dropping it into stdlib. Perhaps more importantly, it prompted some discussion that more or less acknowledged that these kinds of APIs do seem to be the future of XML in Python, and that we should be thinking about bringing some of them into PyXML and, ultimately, stdlib. But the problem of how to choose from the many options also became immediately apparent.[2] The discussion stalled, but I think it should start up again, in the proper forum, rather than letting the first-mentioned API supplant the dozen+ alternatives that could also be considered as candidates.[3] Sorry to be a sourpuss. Mike -- [1] http://mail.python.org/pipermail/xml-sig/2005-November/011248.html (Guido's very civil proposal and request for peer review) [2] http://mail.python.org/pipermail/xml-sig/2005-November/011252.html (this also summarizes the categories of software/approaches that people are taking to the general problem of working with XML Pythonically) [3] http://www.xml.com/pub/a/2004/10/13/py-xml.html (and there are at least 3 more databinding APIs that have come out since then) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] ElementTree in stdlib
Martin v. L So as that has more-or-less failed, the next natural approach is let's believe in the community. For that, two things need to happen: the author of the package must indicate that he would like to see it incorporated, and the users must indicate that they like the package. Both has happened for ElementTree, but I think it could happen for other packages, as well. If it is merely the lack of due process you are complaining about, and you agree with the result, then IMO nothing would need to be changed about the result. Discussing it post-factum on xml-sig might still be valuable. Thanks Martin and others for responding. I full agree that ElementTree has proven to be useful, popular, and stable, and probably no one would object to ElementTree being given the endorsement that is implicit in its being made a part of stdlib. The lack of due process, given that XML-SIG seems to exist largely to provide that very service for all things XML in Python, is indeed all I'm complaining about. I am happy that for once, there is momentum behind this sort of thing, and more power to you for that. My fears are just that 1. XML-SIG is being seen as either irrelevant or as an obstacle (perhaps due to the friction between Fredrik and Uche) and are thus being sidestepped, and 2. other libs that could/should be contenders (Amara and 4Suite are not in this list, by the way) are going to become further marginalized by virtue of the fact that people will say well, we have ElementTree in stdlib already, why do we need (fill in the blank)? I suppose the same kind of implicit endorsements were given to minidom and SAX, and that obviously hasn't prevented people from going out and using ElementTree, lxml, etc., so I don't know... I can't predict the future. I'd just feel better about it if everyone on XML-SIG, where people hang out because they have a definite interest in this kind of thing, knew what was going on. Some authors of other libs may not even be aware that they could so easily have their code whisked into stdlib, if it's solid enough. Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] urlparse brokenness
Paul Jimenez wrote: So I propose that urlsplit, the main offender, be replaced with something that looks like: def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')): +1 in principle. You should probably do a global _parse_cache and add 'is not None' after 'if cached'. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bug in urlparse
[EMAIL PROTECTED] wrote: According to RFC 2396[1] section 5.2: RFC 2396 is obsolete. It was superseded by RFC 3986 / STD 66 early this year. In particular, the procedure for removing dot-segments from the path component of a URI reference -- a procedure that is only supposed to be done when 'resolving' a reference to absolute form (i.e., merging it with a base URI, which, being a URI, not a URI reference, is not allowed to contain dot-segments) -- has received a significant overhaul. The implementation guidance you quoted from RFC 2396 is no longer relevant. Technically, it never was relevant, since urlparse only claims to implement RFC 1808 (2396's predecessor, now ten years old). The new procedure says ...dot-segments are intended for use in URI references to express an identifier relative to the hierarchy of names in the base URI. The remove_dot_segments algorithm respects that hierarchy by removing extra dot-segments rather than treat them as an error or leaving them to be misinterpreted by dereference implementations. -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] string find(substring) vs. substring in string
Fredrik Lundh wrote: any special reason why in is faster if the substring is found, but a lot slower if it's not in there? Just guessing here, but in general I would think that it would stop searching as soon as it found it, whereas until then, it keeps looking, which takes more time. But I would also hope that it would be smart enough to know that it doesn't need to look past the 2nd character in 'not the xyz' when it is searching for 'not there' (due to the lengths of the sequences). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] mimetypes and _winreg
Following up on this 12 Jun 2004 post... Garth wrote: Thomas Heller wrote: Mike Brown [EMAIL PROTECTED] writes: I thought it would be nice to try to improve the mimetypes module by having it, on Windows, query the Registry to get the mapping of filename extensions to media types, since the mimetypes code currently just blindly checks posix-specific paths for httpd-style mapping files. However, it seems that the way to get mappings from the Windows registry is excessively slow in Python. I'm told that the reason has to do with the limited subset of APIs that are exposed in the _winreg module. I think it is that EnumKey(key, index) is querying for the entire list of subkeys for the given key every time you call it. Or something. Whatever the situation is, the code I tried below is way slower than I think it ought to be. Does anyone have any suggestions (besides write it in C)? Could _winreg possibly be improved to provide an iterator or better interface to get the subkeys? (or certain ones? There are a lot of keys under HKEY_CLASSES_ROOT, and I only need the ones that start with a period). See this post I made some time ago: http://mail.python.org/pipermail/python-dev/2004-January/042198.html Should I file this as a feature request? If you still think it should be changed in the core, you should work on a patch. I could file a patch if no one else is looking at it. The solution would be to use RegEnumKeyEx and remove RegQueryInfoKey. This loses compatability with win16 which I guess is ok. Garth I would say it looks like no one else was looking at it, and Garth apparently didn't submit a patch. It's beyond my means to come up with a patch myself. Would someone be willing to take a look at it? Sorry, but I really want access to registry subkeys to stop being so dog-slow. :) Thanks for taking a look, -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com