And Clover ha scritto: > Manlio Perillo wrote: > >> However what about URI (that is, for PATH_INFO and the like)? >> For URI (if I remember correctly) the suggested encoding is UTF-8, so >> URLS should be decoded using > >> url.decode('utf-8', 'surrogateescape') > >> Is this correct? > > The currently-discussed proposal is ISO-8859-1, allowing the real bytes > to be trivially extracted. This is consistent with the other headers and > would be my preferred approach. >
There is something that I don't understand. Some HTTP headers, like Accept-Language, contains data described as `token`, where: token = 1*<any CHAR except CTLs or separators> So a token, IMHO, is an opaque string, and it SHOULD not decoded. In Python 3.x it SHOULD be a byte string. Text content is described as `TEXT`, where: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. TEXT = <any OCTET except CTLs, but including LWS> The only type of data where TEXT can be used is `quoted-string`. A `quoted-string` only appears in well specified portions of an header. So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP headers as Unicode strings. This is up to the application/framework, that must parse each header, split it in component and handle them as more appropriate (as byte string, Unicode string or instance of some other data type). > [...] Regards Manlio _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com