Re: [Web-SIG] Future of WSGI
On Tue, Nov 24, 2009 at 10:50:00PM +0100, Malthe Borch wrote: How people use or abuse software is not our concern; but the standard library should not itself abuse its own abstractions. Your assumption is that `environ` == HTTP headers. That's simply NOT the case. A request is: - A request line - Some headers - A body (See http://tools.ietf.org/html/rfc2616#section-5) The request body, the request method (GET, POST, ...), the request URL, the HTTP version are all in `environ`. If you really want to separate the headers from the rest you would put another dictionary containing the headers inside `environ`. Instead WSGI puts the headers prefixed with HTTP_ in `environ`, because that's what CGI is doing. It might not be 100% clean, or logic, but it's SIMPLER, there's no need to deal with nested dictionaries or other more complex structure, and it's extensible. Request = namedtuple(Request, environ body) Response = namedtuple(Response, status headers iterable) Iterable might be body or chunks or some other term. namedtuple is Python 2.6+: WSGI can't use it. WSGI must work w/ older versions of Python. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] [RFC] urllib2 requests history + HEAD support
On Sun, Dec 20, 2009 at 11:38:19PM +0530, Senthil Kumaran wrote: I need your opinion on this request. http://bugs.python.org/issue1673007 Python Standard Library module urllib2 has support GET and POST. There was a feature request to add support for HEAD requests. It would be nice to have other methods too, like PUT DELETE: http://tools.ietf.org/html/rfc2616#page-52 While that is valid feature request, there was suggestion to include a history of the requests in the module. I don't find any references in the RFCS for any such requirement to maintain a history of requests. Do you have any opinion on whether is it a good idea to have history of requests in the urllib2 module? I personally feel that history of requests can be easier tracked by the clients. This should be done by the client. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
On Thu, Dec 03, 2009 at 07:35:14PM +0100, And Clover wrote: I don't know what the HTTP/Cookie spec says about this. The traditional interpretation of RFC2616 is that headers are ISO-8859-1. You will notice that no browser correctly follows this. The RFC 2109 2965 say that a cookie's value can be anything: The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. Theoricaly you could put something like: 'foo\n\0bar' in a cookie. Also a cookie can include comments which have to be encoded using ... UTF-8: Comment=value OPTIONAL. Because cookies can be used to derive or store private information about a user, the value of the Comment attribute allows an origin server to document how it intends to use the cookie. The user can inspect the information to decide whether to initiate or continue a session with this cookie. Characters in value MUST be in UTF-8 encoding. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTTP headers encoding
On Thu, Dec 03, 2009 at 05:09:31PM +0100, Manlio Perillo wrote: This is really a mess. RFC 2617 doesn't specify any encoding for its headers, so it should be latin-1 everywhere. But on the web nobody respect standards. How is authorization username handled in common WSGI frameworks? As far as I know, they don't handle this. They just return the string without dealing with the encoding issues. I think there is no correct way of handling this, because 99% of username/password contain only ascii characters. A possible 'workaround' would be to limit yourself to the ascii charset. If you get a non-ascii character raise an Exception. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTTP headers encoding
On Thu, Dec 03, 2009 at 08:33:19PM +0100, Manlio Perillo wrote: Right now I'm doing a: username.decode('us-ascii', 'replace') Or like most frameworks you could let the application author deal with the problem, just pass the raw strings to the application. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote: There is something that I don't understand. Some HTTP headers, like Accept-Language, contains data described as `token`, where: token = 1*any CHAR except CTLs or separators So a token, IMHO, is an opaque string, and it SHOULD not decoded. In Python 3.x it SHOULD be a byte string. I think this is more an issue that frameworks should deal with. By decoding every headers value to latin-1: * It keeps WSGI simple. Simple is good. * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) says. WSGI is about HTTP, but that doesn't necessarily includes all other standards extending HTTP. * It's possible to convert latin-1 strings to bytes without losing data. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Future of WSGI
On Tue, Nov 24, 2009 at 11:36:57PM +0100, Malthe Borch wrote: 2009/11/24 Henry Precheur he...@precheur.org: (See http://tools.ietf.org/html/rfc2616#section-5) The request body, the request method (GET, POST, ...), the request URL, the HTTP version are all in `environ`. That reference does not mention the environment. It's not an official term. Are you talking about PEP-333 or RFC 2616? namedtuple is Python 2.6+: WSGI can't use it. WSGI must work w/ older versions of Python. It was meant as illustration, but sure. Then what? Your proposal doesn't work. So let's forget about it and stick to dict? -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Future of WSGI
On Tue, Nov 24, 2009 at 11:16:05PM +0100, Sylvain Hellegouarch wrote: Though it shouldn't be considered as a problem, the fact that probably no existing framework actually use the raw dictionary (there is, in almost all cases, a wrapping into a friendlier object), one might wonder why keeping such a low level interface rather than directly provide a higher level interface is a good idea. After all creating those dictionaries for no good reason aside from sending them to the next layer which will map them into a WebOb, a yaro, a cherrypy request, or zope request, etc. seems slightly pointless 1. Would you say that POSIX is useless because there are lots of libraries and applications build on top of it? Why not implement those libraries and applications directly without using POSIX? 2. Guess what: WebOb, Werkzeug, Yaro, Django, CherryPy, and the others have a different interfaces for their Request/Response objects. Because for Request/Response there's hardly one-size fits all. There's certainly some common ground, but every framework has different needs. (I'm not versed into Python internals, but doesn't it have also a cost of creating rather useless objects repeatedly like that?) The dictionary is passed as a reference like every Python objects. So it doesn't cost anything to use it instead of an object. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Web Framework
On Sun, May 31, 2009 at 09:30:26AM -0700, Omar Munk wrote: - A good documentation. - Not to overkill like Django - Easy and simple - Just something like PHP but without the dirty style. - I like Karrigell but it looks like it's dead do you know a clone of it? - Not need a VPS to host it, just a server that has Python. I would still recommend Django. I think it's the best web-framework if you are beginning. It's not like PHP, but I don't know of anything like PHP in Python. And creating your own is that hard? Yes, it's hard, especially if you are new to web development. Cheers, -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO
On Tue, Sep 22, 2009 at 11:26:15PM -0400, P.J. Eby wrote: +1, if you mean the strings have the same content, character-for-character on Python 2.3. That is, a \x80 byte in a Python 2 'str' is matched by an \x80 character in the Python 3 'str'. (I presume that's what we mean by native, but I want to be sure.) It is the case (Python 3 code): ord(b'\x80'.decode('latin1')) == b'\x80'[0] True Also I'd like to point out that the Cookie problem could be more general than we think. HTTP_COOKIE is the only header we have identified so far with a weird encoding scheme. But I am pretty sure some idiots have or will create other weird headers with strange encoding scheme --let's mix UTF-8 latin1 just for the fun of it. By defaulting to latin-1 it will ensure that WSGI is solid enough to face these weird situations. I stronly backs the use of a single encoding. The proposed wsgi.uri_encoding method doesn't seem to add anything compared to latin-1. Ian's proposal seems to be fairly complete and address all the issue we had, with the exception of the outstanding issues he pointed out at the end of his mail. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO
On Tue, Sep 22, 2009 at 09:22:48PM -0500, Ian Bicking wrote: Well, the biggie: is it right to use native strings for the environ values, and response status/headers? Specifically, tricks like the latin1 transcoding won't work in Python 2, but will in Python 3. Is this weird? Or just something you have to think about when using the two Python versions? I don't have the whole discussion in mind. But except 'using unicode everywhere', I don't think there's a single proposal that would allow people to keep to same 'logic' in both Python 2 3. Using bytes in Python 3 requires you to have 2 different 'logic' for Python 2 and 3, because of the limitation of bytes which can't do all what str can do and the stdlib's problems with bytes. Using str in Python 3 requires you to have 2 different 'logic' too. Because Python 3's str are not Python 2's str. (Just to make things clear the term 'logic' refers to transcoding of strings into the correct encoding) What happens if you give unicode text in the response headers that cannot be encoded as Latin1? We can ignore the header. But if a response header contains non-Latin-1 characters, it's not WSGI compliant, I would therefor expect an error. To cite The Zen of Python: Errors should never pass silently. Should some things be unicode on Python 2? No. I think it's more important to keep WSGI simple. Let's use str everywhere. Frameworks can always transcode what should be Unicode, that's their job. Is there a common case here that would be inefficient? Transcoding every strings from Latin-1 to Unicode could be time consuming. The only way I see to make things faster is to use bytes everywhere, but that's not possible given the previous discussions. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 11:09:24AM -0500, Ian Bicking wrote: I think surrogateescape can resolve the small handful of problems. +1 surrogateescape would be a great alternative to the try utf-8 then latin-1 approach. It would simplify the gateway and the application. No need to check some 'encoding' variable and transcode later. We just encode everything to UTF-8, no special case. surrogateescape isn't implemented (yet?) for Python 2. That's not an issue if the 'new' WSGI sticks to native strings. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote: So the same standard should have different behavior on different Python versions? That would make framework code a lot more complicated. I don't understand why it would be 'a lot more' complicated. (The following code snippets is Python 3 only, and assumes we're using 'native strings' everywhere) In the gateway, environ would be populated this way: environ['some_key'] = some_value.decode('utf8', 'surrogateescape') Compare that to the utf-8-then-latin-1 alternative: try: environ['some_key'] = some_value.decode('utf-8') environ['some_key.encoding'] = 'utf-8' except UnicodeError: environ['some_key'] = some_value.decode('latin-1') environ['some_key.encoding'] = 'latin-1' What you would have in the application to get the original value: environ['some_key'].encode('utf8', 'surrogateescape') With utf8-then-latin1: environ['some_key'].encode(environ['some_key.encoding']) The 'surrogateescape' way is clearly simpler. The 'equivalent' Python 2 code is even simpler: environ['some_key'] = some_value And: environ['some_key'] -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. I am not sure I understand your point. The 0 lines hold true if the whole site is using latin-1 or utf-8 and you write your applications/middlewares only for this site. But if it's using any other encoding you still have to transcode. def middleware(start_response, environ): value = environ['some_key'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) ... With wsgi.uri_encoding you would still have to do the following: def middleware(start_response, environ): value = environ['some_key'].\ encode(environ['some_key.encoding']).\ decode(SITE_ENCODING) ... Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote: The decoding doesn't change spontaneously. You either get the correct one or you get an incorrect one. If it's incorrect, you fix it, one time, via a WSGI component which you've configured to determine the correct decoding. Then every other WSGI component below that one can go back to trusting the decoding was correct. In fact, if you do that transcoding right away, no other WSGI components need to be rewritten to take advantage of unicode. You just have to deploy a single transcoder, that's 6 lines of code max. And you can do that with utf8+surrogateescape too. Except that you don't have to determine what encoding the gateway sent you, it's always utf8+surrogateescape. With utf8+surrogateescape, you don't transcode once, you transcode in every WSGI component in your stack that needs to correct the decoding. You have to do it more than once because, each time you encode/re-decode, you use the result and then throw it away. Any subsequent WSGI components have to encode/re-decode--you cannot store the redecoded URI in SCRIPT_NAME/PATH_INFO, because the utf8+surrogateescape scheme says...well, it's always utf8-decoded. You don't get something REALLY important with surrogateescape: You can ALWAYS get the original bytes back. b = b'fran\xe7cois' s = b.decode('utf8', 'surrogateescape') s 'fran\udce7cois' s.encode('utf8', 'surrogateescape') b'fran\xe7cois' See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a normal UTF-8 character, this character use some 'free space' in the unicode supplementary characters. The only thing you have to do is to pass 'surrogateescape' each time you call encode/decode. In addition, *every* component that needs to compare URI's then has to be configured with the same logic, however convoluted, to perform the correct decoding again. It's not just routing middleware: caches need to reliably compare decoded URI's; so do sessions; so does auth (especially!); so do static files. And Heaven forfend you actually decode differently in two different components! I don't understand why I would need to throw away the decoded string. This works perfectly well a far as I know: environ['PATH_INFO'] = environ['PATH_INFO'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) utf8+surrogateescape provides the same possibilities as wsgi.uri_encoding. You can transcode without losing information when you know what the correct encoding is. But utf8+surrogateescape is simpler because there's no need to pass around the name of the encoding in an additional variable. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2: Decoding the Request-URI
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote: However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading / character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least not without their specification becoming unwieldy. (Just to make things clear, I am not just talking about REQUEST_URI here, but all request headers) Encoding everything using ISO-8859-1 has the nice property of keeping informations intact. It would be good heuristic if everything with a few exceptions was encoded using ISO-8859-1. Just transcode the few problematic cases at the application level and everybody is happy. A string encoded from ISO-8859-1 is like a bytes object with a string 'interface' on top of it. But it sweep the encoding problem under the carpet. The problem with Python 2 was that str and unicode were almost the same, so much the same that it was possible to mix them without too much problems: 'foo' == u'foo' True Python 3 made bytes and string 'incompatible' to force programmers to handle the encoding problem as soon as possible: b'foo' == 'foo' False By passing `str()` to the application, the application author could believe that the encoding problem has been handled. But in most cases it hasn't been handled at all. The application author should still transcode all the strings incorrectly encoded. We are back to Python 2's bad old days, where we can't be sure that what we got is properly encoded: Was that string encoded using latin-1? Maybe a middleware transcoded it to UTF-8 before the application was called. Maybe the application itself transcoded it at some point, but then we need to keep track of what was transcoded. Maybe the application should transcode everything when it is called. Also EVERY application author will have to read the PEP, especially the paragraph saying: Everything we give you are strings, but you still have to deal with the encoding mess. Otherwise he will have weird problems like when he was using Python 2. Because the interface is not clear. strings are supposed to be text and only text. Encoding everything to ISO-8859-1 means strings are not text anymore, they are 'encoded data' [1]. bytes are supposed to be 'encoded data' and binary blobs. By giving applications bytes, the author knows right away he should decode them. No need to read the PEP. `bytes` can do everything `str` can do with the notable exception of 'format'. b'foo bar'.title() b'Foo Bar' b'/foo/bar/fran\xc3ois'.split(b'/') [b'', b'foo', b'bar', b'fran\xc3ois'] re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups() (b'foo', b'1234') I understand that `bytes()` is an unfamiliar beast. But I believe the encoding problem is the realm of the application, not the realm of the gateway. Let the application handle the encoding problem and don't give it a half baked solution. Using bytes also has its set of problems. The standard library doesn't support bytes very well. For example urllib.response.unquote() doesn't work with bytes, and urllib.parse too has issues. [1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote: Correct -- you can write any set of % encodings, and I don't think it even has to be able to validly url-decode (e.g., /foo%zzz will work). It definitely doesn't have to be a valid encoding. However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard). This is in a case like a href=/some page, the browser will request /some%20page, because it escapes unsafe characters. Similarly if you request a href=/fran??ais it will encode that ?? in UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at least on Firefox. I used this to test: http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py I have run some tests regarding the encoding issue: curl doesn't 'url-encode' its URLs: curl 'http://hostname/fran?ais' ^ e7 latin-1 character The latin-1 character is send to the server. Lighttpd accepts the URL and even return a file if it exists. Of course if I try with the same characters in UTF-8 it doesn't work. AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that libcurl is quite popular (it used to be the transport library of Webkit/GTK+ for example.) It's hard to discard it as a utterly broken obscure tool. Many 'simplistic' HTTP clients may have the same problem. Now let's talk a little bit about cookies... Cookies can contain whatever 'binary junk' the server send. RFC 2965 says (http://tools.ietf.org/html/rfc2965#page-5): The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. Also, cookies can contain 'comments' which contains UTF-8 strings. (http://tools.ietf.org/html/rfc2965#page-6): Characters in value MUST be in UTF-8 encoding. Firefox has no problem with cookies containing non-ASCII characters. It looks like it assumes cookies are encoded using latin-1, since latin-1 characters are displayed correctly in Firebug, but not UTF-8 ones. Cheers, -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com