Dirkjan Ochtman wrote:

1. The application is passed an instance of a Python dictionary
   containing what is referred to as the WSGI environment. All keys
   in this dictionary are native strings. For CGI variables, all names
   are going to be ISO-8859-1 and so where native strings are
   unicode strings, that encoding is used for the names of CGI
   variables.

Perhaps explain where those ISO-8859-1 bytes might come from:

    ...are native strings. Where native strings are Unicode, any
    keys derived from byte-oriented sources (such as custom headers
    in the HTTP request reflected in the CGI environment variables)
    should be decoded using the ISO-8859-1 encoding.

3. For the CGI variables contained in the WSGI environment, the values
   of the variables are native strings. Where native strings are
   unicode strings, ISO-8859-1 encoding would be used such that the
   original character data is preserved and as necessary the unicode
   string can be converted back to bytes and thence decoded to unicode
   again using a different encoding.

Good. The only problem that remains with this is that in certain environments (notably: all IIS use, not just CGI) a WSGI gateway cannot fully comply with this requirement.

a. disallow environments that cannot be sure they are preserving the original byte data from declaring that they support wsgi.version 1.1?

b. add an extra wsgi.something flag for a WSGI server to add, to specify that it is sure that the original bytes have been preserved? (ie. so wsgiref's CGI handler would have to declare it wasn't sure when running under Windows.)

c. just let WSGI gateways silently ignore the ISO-8859-1 requirement if they can't honour it and let the application spend its time trying to unravel the mess (status quo).

(Can wsgiref be fixed to use ISO-8859-1 in time for Python 3.2?)

7. The iterable returned by the application and from which response
   content is derived, should yield byte strings. Where native strings
   are unicode strings, the native string type can also be returned in
   which case it would be encoded as ISO-8859-1.

8. The value passed to the 'write()' callback returned by
   'start_response()' should be a byte string. Where native strings
   are unicode strings, a native string type can also be supplied, in
   which case it would be encoded as ISO-8859-1.

Weren't we going to only allow US-ASCII for the output? (These threads are always so far apart I can never remember what conclusion we reached... if any.)

Whilst ISO-8859-1 is in the HTTP standard for headers, and required to preserve bytes in input, it's much, much less likely that the response body is going to be ISO-8859-1. It could maybe be cp1252, but more likely the author wanted UTF-8.

If we must support Unicode strings for response body output at all, I'd prefer to be conservative here and spit a UnicodeEncodeError straight away, rather than quietly mangle characters U+0080 to U+00FF.

Manlio Perillo wrote:

The run_with_cgi sample function should be changed, since it probably
does not work correctly, on Python 3.x.

Yes, the 'URL Reconstruction' fragment will be wrong too, since it uses urllib.quote() to encode the path part. quote() defaults to UTF-8 rather than the ISO-8859-1 WSGI 1.1 requires.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to