Re: [Python-Dev] Fwd: RFC - GoogleSOC proposal -cleanupurllib

2007-03-24 Thread Mike Brown
Senthil Kumaran wrote:
 I have written a proposal to cleanup urllib as part of Google SoC. I am
 attaching the file 'soc1' with this email. Requesting you to go through the
 proposal and provide any feedback which I can incorporate in my submission.

From your proposal:

 2) In all modules, Follow the new RFC 2396 in favour of RFC 1738 and RFC 1808.
 [...]
 In all modules, follow the new RFC 2396 in favor of RFC 1738, RFC 1808. The
 standards for URI described in RFC 2396 is different from older RFCs and
 urllib, urllib2 modules implement the URL specifications based on the older
 URL specification. This will need changes in urlparse and other parse 
 modules to handle URLS as specified in the RFC2396.

The new RFC 2396 was superseded by STD 66 (RFC 3986) two years ago. Your
failure to notice this development doesn't bode well :) j/k, although it does
undermine confidence somewhat.

I think the bugfixes sound great, but major enhancements and API refactorings
need to be undertaken more cautiously.

In any case, I have a few suggestions:

- Read http://en.wikipedia.org/wiki/Uniform_Resource_Identifier.
  (I wrote the majority of it, and got peer review from the URI WG a while
back).

- Read http://en.wikipedia.org/wiki/Percent_encoding.
  (I wrote most of this too).

- Familiarize yourself with STD 66. (i.e., don't trust anything I wrote ;))
  Especially note its differences from RFC 2396 (summarized in an appendix).

- Seek peer review for any changes that you attribute to changing standards.

In my experience implementing a general-purpose URI processing library
(http://cvs.4suite.org/viewcvs/4Suite/Ft/Lib/Uri.py?view=markup ),
there were times when I thought the standard was saying a bit more than it
really was, especially when it came to percent-encoding, which has several
somewhat-conflicting conventions and standards governing it. I tried to
cover these in the Wikipedia article.

- Anticipate real-world use cases. If you go down the road of doing what
the standards recommend (be aware of should vs must and whether
it's directed at URI producers or consumers), you might lose sight of the
fact that there's a reason, for example, people use encodings other than
the recommended UTF-8 as the basis for percent-encoding. Similarly,
expectations surrounding the behavior of 'file' URIs and path-portions
thereof are sometimes less than optimal in the real world. If you're
designing an API, be flexible, and seek review for any compatibilities
you intend to introduce.

- Be aware of the fact that people might have different expectations when
they use different string types (unicode, str) in URI processing, and
different levels of awareness of the levels of abstraction at which URI
processing operates. It can be difficult to uniformly handle unicode and str.
And then there's IRIs (RFC 3987)...

For additional background, you might also check the python-dev discussion
of urllib in Sep 2004, urlparse in Nov 2005, and the competing uriparse.py
proposals (Apr, Jun 2006).

Mike

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unicode bug resuscitation attempt

2006-07-15 Thread Mike Brown
Stefan Rank wrote:
 Well, originally, I would have expected it to return a byte str(ing),

I'd assume unicode in, unicode out, and str in, str out, but since it's
always going to produce ASCII-range characters, it wouldn't matter.
Moot point anyway.

 BUT
 I am now converted and think it is best to raise a TypeError for 
 unicode, and leave the encoding decisions to higher level code.
 
 So I'll repeat the patch #1, slightly modified::
 
   if isinstance(s, unicode):
   raise TypeError(quote expects an encoded byte string as argument)
 
 Is it safe to assume that code that breaks because of this change was 
 already broken?

Yes. The patch seems fine to me, although maybe

if not isinstance(s, str)

would be better?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urllib.quote and unicode bug resuscitation attempt

2006-07-13 Thread Mike Brown
Stefan Rank wrote:
 on 12.07.2006 07:53 Martin v. Löwis said the following:
  Anthony Baxter wrote:
  The right thing to do is IRIs. 
  For 2.5, should we at least detect that it's unicode and raise a 
  useful error?
  
  That can certainly be done, sure.
  
  Martin
 
 That would be great.
 
 And I agree that updating urllib.quote for unicode should be part of a 
 grand plan that updates all of urllib[2] and introduces an irilib / 
 urischemes / uriparse module in 2.6 as Martin and John J Lee suggested.
   =)
 
 cheers,
 stefan

Put me down as +1 on raising a useful error instead of a KeyError or whatever,
and +1 on having an irilib, but -1 on working toward accepting unicode in the
URI-oriented urllib.quote(), because (a.) user expectations for strings that
contain non-ASCII-range characters will vary, and (b.) percent-encoding is
supposed to only operate on a byte-encoded view of non-URI information, not
the information itself.

Longer explanation:

I, too, initially thought that quote() was outdated since it choked on unicode
input, but eventually I came to realize that it's wise to reject such input,
because to attempt to percent-encode characters, rather than bytes, reflects a
fundamental misunderstanding of the level at which percent-encoding is
intended to operate.

This is one of the hardest aspects of URI processing to grok, and I'm not
very good at explaining it, even though I've tried my best in the Wikipedia
articles. It's basically these 3 points:

1. A URI can only consist of 'unreserved' characters, as I'm sure you know. 
It's a specific set that has varied slightly over the years, and is a subset 
of printable ASCII.

2. A URI scheme is essentially a mapping of non-URI information to a sequence
of URI characters. That is, it is a method of producing a URI from non-URI
information within a particular information domain ...and vice-versa.

3. A URI scheme should (though may not do so very clearly, especially the
older it is!) tell you that the way to represent a particular bit of non-URI
information, 'info', in a URI is to convert_to_bytes(info), and then, as per
STD 66, make the bytes that correspond, in ASCII, to unreserved characters
manifest as those characters, and all others manifest as their percent-encoded
equivalents. In urllib parlance, this step is 'quoting' the bytes.

3.1. [This isn't crucial to my argument, but has to be mentioned to complete
the explanation of percent-encoding.] In addition, those bytes corresponding,
in ASCII, to some 'reserved' characters are exempt from needing to be
percent-encoded, so long as they're not being used for their reserved purpose
(if any) in whatever URI component they're going into -- Semantically, there's
no difference between such bytes when expressed in the URI as a literal
reserved character or as a percent-encoded byte. URI scheme specs vary greatly
in how they deal with this nuance. In any case, urllib.quote() has the 'safe' 
argument which can be used to specify the exempt reserved characters.



In the days when the specs that urllib was based on were relevant, 99% of the
time, the bytes being 'quoted' were ASCII-encoded strings representing ASCII
character-based non-URI information, so quite a few of us, including many URI
scheme authors, were tempted to think that what was being
'quoted'/percent-encoded *was* the original non-URI information, rather than a
bytewise view of it mandated by a URI scheme.  That's what I was doing when I
thought that quote(some_unicode_path) should 'work', especially in light of
Python's treat all strings alike guideline.  But if you accept all of the
above, which is what I believe the standard requires, then unicode input is a
very different situation from str input; it's unclear whether and how the
caller wants the input to be converted to bytes, if they even understand what
they're doing at all.

See, right now, quote('abc 123%') returns 'abc%20123%25', as you would expect. 
Similarly, everyone would probably expect u'abc 123%' to return
u'abc%20123%25', and if we were to implement that, there'd probably be no harm 
done.

But look at quote('\xb7'), which, assuming you accept everything I've said
above is correct, rightfully returns '%B7'.  What would someone expect
quote(u'\xb7') to return?  Some might want u'%B7' because they want the same
result type as the input they gave, with no other changes from how it would
normally be handled. Some might want u'%C2%B7' because they're conflating the
levels of abstraction and expect, say, UTF-8 conversion to be done on their
input.  Some (like me) might want a TypeError or ValueError because we
shouldn't be handing such ambiguous data to quote() in the first place. And 
then there's the u'\u0100'-and-up input to worry about; what does a user
expect to be done with that?

I would prefer to see quote() always reject unicode input with a TypeError. 
Alternatively, if it accepts unicode, it should produce unicode, and since it
can only reasonably assume what 

Re: [Python-Dev] UUID module

2006-06-10 Thread Mike Brown
Fredrik Lundh wrote:
 Ka-Ping Yee wrote:
 
  Quite a few people have expressed interest in having UUID
  functionality in the standard library, and previously on this
  list some suggested possibly using the uuid.py module i wrote:
  
  http://zesty.ca/python/uuid.py
 
 +1!

+1 as well.

I have a couple of suggestions for improving that implementation:

1. You're currently using os.urandom, which can raise a NotImplementedError. 
You should be prepared to fall back on a different PRNG... which leads to the
2nd suggestion:

2. random.randrange is a method on a default random.Random instance that,
although seeded by urandom (if available), may not be the user's preferred
PRNG.  I recommend making it possible for the user to supply their own
random.Random instance for use by the module.

That's all. :)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Some more comments re new uriparse module, patch 1462525

2006-06-08 Thread Mike Brown
John J Lee wrote:
  http://python.org/sf/1500504
 [...]
 
 At first glance, looks good.  I hope to review it properly later.
 
 One point: I don't think there should be any mention of URL in the 
 module -- we should use URI everywhere (see my comments on Paul's 
 original version for a bit more on this).

Agreed.

Although you've added the test cases from 4Suite and credited me for them,
only a few of the test cases were invented by me.  I'd rather you credited
them to their original sources, as I did.

Also, I believe Graham Klyne has been adding some new cases to his Haskell
tools, but hasn't been updating the other spreadsheet and RDF files in which 
he publishes them in a more usable form. My tests only use what's in the 
spreadsheet, so I've only got 88 out of 99 testRelative cases from
http://cvs.haskell.org/cgi-bin/cvsweb.cgi/fptools/libraries/network/tests/URITest.hs
So if you really want to be thorough, grab the missing cases from there.

-

It appears that Paul uploaded a new version of his library on June 3:
http://python.org/sf/1462525
I'm unclear on the relationship between the two now. Are they both up for 
consideration?

-

One thing I forgot to mention in private email is that I'm concerned that the
inclusion of URI reference resolution functionality has exceeded the scope of
this 'urischemes' module that purports to be for 'extensible URI parsing'.  It
is becoming a scheme-aware and general-purpose syntactic processing library
for URIs, and should be characterized as such in its name as well as in its
documentation. 

Even without a new name and more accurately documented scope, people are going
to see no reason not to add the rest of STD 66's functionality to it
(percent-encoding, normalization for testing equivalence, syntax
validation...). As you can see in Ft.Lib.Uri, the latter two are not at all
hard to implement, especially if you use regular expressions. These all fall 
under syntactic operations on URIs, just like reference-resolution.

Percent-encoding gets very hairy with its API details due to application-level
uses that don't jive with STD 66 (e.g. the fuzzy specs and convoluted history
governing application/x-www-form-urlencoded), the nuances of character
encoding and Python string types, and widely varying expectations of users.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New Pythondoc by effbot

2006-01-24 Thread Mike Brown
BJ Why does it have to be wiki-like? Why can't it be a wiki? MediaWiki
 seem to work pretty well for a lot of software projects that have put
 their documentation in a wiki. Talk pages for commentary and primary
 pages for reviewed content.

And inconsistent formatting from article to article, limitations in indexing
options, no way to browse a set of documentation specific to a particular
Python release...

I personally like the PHP docs - http://www.php.net/manual/en/

They're not versioned, and users can't modify the indexes or API docs, but
they do get to add annotations. Annotations that reflect errors or major 
omissions from the docs can be reviewed by editors and folded into the docs
as needed. *shrug*
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] ElementTree in stdlib

2005-12-12 Thread Mike Brown
Catching up on some python-dev email, I was surprised to see that things seem 
to be barrelling ahead with the adding of ElementTree to Python core without 
any discussion on XML-SIG. Sidestepping XML-SIG and the proving grounds of 
PyXML in order to satsify the demand for a Pythonic databinding+API for XML in 
stdlib seems to be a bit of a raised middle finger to those folks who have 
worked hard on competing or differently-scoped APIs, each of which deserves a 
bit more peer review than just a single nomination on python-dev, which seems 
to be all it took to obtain a blessing for ElementTree. I have nothing against 
ElementTree, and would like to see more XML processing options in core, but it 
seems to me like the XML-SIG is being deliberately left out of this process.

Just last month, Guido submitted to XML-SIG a Pythonic XML API that he had 
been tinkering with.[1] I don't think anyone was really bold enough to tell 
him what they really thought of it (other than that it is a lot like XIST), 
but it was admirable that he put it up for peer review rather than just 
dropping it into stdlib. Perhaps more importantly, it prompted some discussion 
that more or less acknowledged that these kinds of APIs do seem to be the 
future of XML in Python, and that we should be thinking about bringing some of 
them into PyXML and, ultimately, stdlib. But the problem of how to choose from 
the many options also became immediately apparent.[2] The discussion stalled, 
but I think it should start up again, in the proper forum, rather than letting 
the first-mentioned API supplant the dozen+ alternatives that could also be 
considered as candidates.[3]

Sorry to be a sourpuss.

Mike
-- 

[1] http://mail.python.org/pipermail/xml-sig/2005-November/011248.html
 (Guido's very civil proposal and request for peer review)
[2] http://mail.python.org/pipermail/xml-sig/2005-November/011252.html (this
 also summarizes the categories of software/approaches that people are
 taking to the general problem of working with XML Pythonically)
[3] http://www.xml.com/pub/a/2004/10/13/py-xml.html (and there are at least
 3 more databinding APIs that have come out since then)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ElementTree in stdlib

2005-12-12 Thread Mike Brown
Martin v. L So as that has more-or-less failed, the next natural approach is
 let's believe in the community. For that, two things need to
 happen: the author of the package must indicate that he would like
 to see it incorporated, and the users must indicate that they like
 the package. Both has happened for ElementTree, but I think it
 could happen for other packages, as well.
 
 If it is merely the lack of due process you are complaining about,
 and you agree with the result, then IMO nothing would need to be
 changed about the result. Discussing it post-factum on xml-sig
 might still be valuable.

Thanks Martin and others for responding.

I full agree that ElementTree has proven to be useful, popular, and stable, 
and probably no one would object to ElementTree being given the endorsement 
that is implicit in its being made a part of stdlib. 

The lack of due process, given that XML-SIG seems to exist largely to provide 
that very service for all things XML in Python, is indeed all I'm complaining 
about. I am happy that for once, there is momentum behind this sort of thing,
and more power to you for that.

My fears are just that 1. XML-SIG is being seen as either irrelevant or as an 
obstacle (perhaps due to the friction between Fredrik and Uche) and are thus 
being sidestepped, and 2. other libs that could/should be contenders (Amara 
and 4Suite are not in this list, by the way) are going to become further 
marginalized by virtue of the fact that people will say well, we have 
ElementTree in stdlib already, why do we need (fill in the blank)?

I suppose the same kind of implicit endorsements were given to minidom and 
SAX, and that obviously hasn't prevented people from going out and using 
ElementTree, lxml, etc., so I don't know... I can't predict the future. I'd 
just feel better about it if everyone on XML-SIG, where people hang out 
because they have a definite interest in this kind of thing, knew what was 
going on. Some authors of other libs may not even be aware that they could so 
easily have their code whisked into stdlib, if it's solid enough.

Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] urlparse brokenness

2005-11-23 Thread Mike Brown
Paul Jimenez wrote:
 So I propose that urlsplit, the main offender, be replaced with something
 that looks like:
 
 def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):

+1 in principle.

You should probably do a
global _parse_cache

and add 'is not None' after 'if cached'.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bug in urlparse

2005-09-08 Thread Mike Brown
[EMAIL PROTECTED] wrote:
 According to RFC 2396[1] section 5.2:

RFC 2396 is obsolete. It was superseded by RFC 3986 / STD 66 early this year.

In particular, the procedure for removing dot-segments from the path component 
of a URI reference -- a procedure that is only supposed to be done when 
'resolving' a reference to absolute form (i.e., merging it with a base URI, 
which, being a URI, not a URI reference, is not allowed to contain 
dot-segments) -- has received a significant overhaul.

The implementation guidance you quoted from RFC 2396 is no longer relevant. 
Technically, it never was relevant, since urlparse only claims to implement 
RFC 1808 (2396's predecessor, now ten years old).

The new procedure says

  ...dot-segments are intended for use in URI references to
   express an identifier relative to the hierarchy of names in the base
   URI.  The remove_dot_segments algorithm respects that hierarchy by
   removing extra dot-segments rather than treat them as an error or
   leaving them to be misinterpreted by dereference implementations.

-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] string find(substring) vs. substring in string

2005-02-16 Thread Mike Brown
Fredrik Lundh wrote:
 any special reason why in is faster if the substring is found, but
 a lot slower if it's not in there?

Just guessing here, but in general I would think that it would stop searching 
as soon as it found it, whereas until then, it keeps looking, which takes more 
time. But I would also hope that it would be smart enough to know that it 
doesn't need to look past the 2nd character in 'not the xyz' when it is 
searching for 'not there' (due to the lengths of the sequences).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] mimetypes and _winreg

2005-02-01 Thread Mike Brown
Following up on this 12 Jun 2004 post...

Garth wrote:
 Thomas Heller wrote:
 Mike Brown [EMAIL PROTECTED] writes:
 I thought it would be nice to try to improve the mimetypes module by having 
 it, on Windows, query the Registry to get the mapping of filename 
 extensions 
 to media types, since the mimetypes code currently just blindly checks 
 posix-specific paths for httpd-style mapping files. However, it seems that 
 the 
 way to get mappings from the Windows registry is excessively slow in Python.
 
 I'm told that the reason has to do with the limited subset of APIs that are 
 exposed in the _winreg module. I think it is that EnumKey(key, index) is 
 querying for the entire list of subkeys for the given key every time you 
 call 
 it. Or something. Whatever the situation is, the code I tried below is way 
 slower than I think it ought to be.
 
 Does anyone have any suggestions (besides write it in C)? Could _winreg 
 possibly be improved to provide an iterator or better interface to get the 
 subkeys? (or certain ones? There are a lot of keys under HKEY_CLASSES_ROOT, 
 and I only need the ones that start with a period).
 
 See this post I made some time ago:
 http://mail.python.org/pipermail/python-dev/2004-January/042198.html
 
 Should I file this as a feature request?
 
 If you still think it should be changed in the core, you should work on
 a patch.
 
 I could file a patch if no one else is looking at it. The solution would 
 be to use RegEnumKeyEx and remove RegQueryInfoKey. This loses
 compatability with win16 which I guess is ok.
 
 Garth

I would say it looks like no one else was looking at it, and Garth apparently 
didn't submit a patch. It's beyond my means to come up with a patch myself. 
Would someone be willing to take a look at it?

Sorry, but I really want access to registry subkeys to stop being so dog-slow. 
:)

Thanks for taking a look,

-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com