On Mon, Aug 3, 2009 at 11:28 PM, Graham Dumpleton<graham.dumple...@gmail.com> wrote: >> Mainly I'm wondering, what should the server do in the event they receive a >> byte string which is not valid UTF-8? (Latin-1 doesn't have this problem, >> since there's no such thing as an invalid Latin-1 string, at least not at >> the encoding level.) > > Can you clarify. We aren't talking about request content here. The > wsgi.input stream is still binary and up to WSGI application to decode > how it decides it should be decoded.
You could receive something like GET /fran%E7ais which if you do: urllib.unquote('/fran%E7ais').decode('utf8') you will get an error. So what should the server do? Obviously anyone at any time can embed <a href="/fran%E7ais"> in a document, and the browser is not going to try to figure out that encoding, it's just going to follow that URL. From my testing (in http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py) the browser will be consistent about UTF8 when it does the encoding itself; but it doesn't necessarily do the encoding itself. QUERY_STRING will *not* necessarily be UTF8, even when the path is UTF8 (but this doesn't matter for us, because QUERY_STRING doesn't get url-decoded, so it's just ASCII with %-encoding). > The only related thing I can think you are talking about is the form > target URL, which is an issue for GET and POST requests, or other > method types, from a form. > >>> Also shown though that SCRIPT_NAME part has to be UTF-8 >>> and we would really be entering fantasy land if you were somehow going >>> to cope with some different encoding for PATH_INFO and QUERY_STRING. >>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one >>> particular area means you are effectively bound to use UTF-8 >>> everywhere else. >> >> I'm not clear on your logic here. If I request foo/bar/baz (where baz >> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the >> script, then the (accented) baz is legitimate for pass-through to the >> application, no? > > Technically, but what I am pointing out is that Apache pretty well > says that foo/bar needs to be UTF-8. If you are going to have > different parts of the one URL needing a different encoding to be > understood, personally I would say you asking for trouble. So, am > saying that UTF-8 needs to really apply more for sake of sanity and > portability. Apache's limitations can't be encoded into WSGI. Yes, it won't work with Apache (I guess, though with ProxyPass / or something, is this a problem?) -- but the idea of mapping request paths to files has nothing to do with WSGI. >> I just tried testing this with Firefox and Apache, and found that you can in >> fact pass such Latin-1 strings through to PATH_INFO, but at least in the >> case of Firefox, you have to %-escape them. However, they are seen by >> Python (via os.environ) as latin-1 encoded byte strings. > > By using % escapes you are in practice overriding the encoding that > the browser may be applying to URL if given raw character? What > happens if you were to paste the accented character direct into the > browser URL bar? Browsers I have played with would normally > automatically translate that as UTF-8 and send it as such, with % > encoding as necessary. Correct; the browser encodes non-ASCII characters as UTF8, but does not try to inspect the encoding of already %-encoded characters. > So I guess the problem is more where URLs are already % encoded when > coming back as href or form action because they may be in an encoding > incompatible with UTF-8 if it were to be clicked on. > >>> Further example of why UTF-8 reaches into everything is mod_rewrite >>> module for Apache. This allows you to do stuff related to SCRIPT_NAME, >>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache >>> configuration file has to be UTF-8. If URL isn't, then wouldn't be >>> possible to perform matches against non latin-1 characters in a >>> rewrite condition or rule. This is because your match string would be >>> in different encoded form to that in URL and so wouldn't match. >> >> Note that this still doesn't have any impact on the bytes that actually >> reach the application, which can be non-UTF8. At minimum, the proposal is >> underspecified as to how to handle this case, which is as trivial to >> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s) >> of a URL. > > The Apache server at least will decode those % escape sequence and I > believe it is the result of that which is used in stuff like rewrite > rule matches, not the raw URL. The only exception would be if rewrite > rule explicit matched against REQUEST_URI variable which still > contains % escape sequences. So if not in UTF-8, means effectively > that you can't then match them with Apache rewrite rules then. _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com