Graham Dumpleton wrote: > Ian, know you have seen this before, but didn't realise you hadn't > cc'd the list. I have added a new response to part 4 of what you > originally sent that wasn't in first reply that went direct to you. > > 2009/8/4 Ian Bicking <i...@colorstudy.com>: >> On Mon, Aug 3, 2009 at 7:38 PM, Graham >> Dumpleton<graham.dumple...@gmail.com> wrote: >>> So, for WSGI 1.0 style of interface and Python 3.0, the following is >>> what I was going to implement. >>> >>> 1. When running under Python 3, applications SHOULD produce bytes >>> output, status line and headers. >> Sure. >> >>> This is effectively what we had before. The only difference is that >>> clarify that the 'status line' values should also be bytes. This >>> wasn't noted before. I had already updated the proposed WSGI 1.0 >>> amendments page to mention this. >>> >>> 2. When running under Python 3, servers and gateways MUST accept >>> strings for output, status line and headers. Such strings must be >>> converted to bytes output using 'latin-1'. If string cannot be >>> converted then is treated as an error. >>> >>> This is again what we had before except that mention 'status line' value. >> Sure. ASCII for the status would be acceptable, as I believe that is >> an HTTP constraint. >> >>> 3. When running under Python 3, servers MUST provide wsgi.input as a >>> binary (byte) input stream. >>> >>> No change here. >> Yep. >> >>> 4. When running under Python 3, servers MUST provide a text stream for >>> wsgi.errors. In converting this to a byte stream for writing to a >>> file, the default encoding would be applied. >>> >>> No real change here except to clarify that default encoding would >>> apply. Use of default encoding though could be problematic if >>> combining different WSGI components. This is because each WSGI >>> component may have been developed on system with different default >>> encoding and so one may expect to log characters that can't be written >>> on a different setup. Not sure how you could solve that except to say >>> people have default encoding be UTF-8 for portability. >> Sure. We might specify that the server should never give an encoding >> error; it should use 'replace' or something to make sure it won't >> fail. Maybe it should be specified what should happen when bytes are >> received. I generally believe that error handling code should try to >> be as robust as possible, so it shouldn't fail regardless of what it >> is given. > > Not that it matters, but looks like that for Apache/mod_wsgi > wsgi.errors should be an instance of io.TextIOWrapper wrapping > internal mod_wsgi specific buffer object providing interface > compatible with io.BufferedIOBase. If someone uses write() on wrapper > with bytes it will fail: > > TypeError: write() argument 1 must be str, not bytes > > If someone use print() to output data, then bytes would be converted > okay. That is: > > print(b'1234', file=environ['wsgi.errors']) > > yields: > > b'1234'. > > If 'replace' is used for errors, you do end up with data loss. Use of > 'xmlcharrefreplace' at least preserves values as numbers, although for > Apache at least, if use 'ascii' encoding, you get a bit of a mess as > the backslashes get escaped again. > > \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 > > instead of original: > > \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 > > That is because Apache logging functions escape anything which isn't > printable ASCII and in turn escapes backslash denoting escaped > character. > > If use encoding of utf-8 instead, then byte values get passed and > Apache logging functions then just escape the non printable bytes > instead so all up looks nicer. > > \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c > \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 > > So for Apache/mod_wsgi at least, best thing to do seems to use > 'replace' and 'utf-8' due to way that Apache error logging functions > work. > > I guess the point from this is that possibly should specify that > wsgi.errors should be an instance of io.TextIOWrapper. A specific > implementation should not use 'strict', but use 'replace' or > 'backslashreplace' as makes sense, dependent on what encoding it needs > to use and how any underlying logging system it overlays works. The > intent overall being to preserve as much of raw information as > possible. > >>> 5. When running under Python 3, servers MUST provide CGI HTTP and >>> server variables as strings. Where such values are sourced from a byte >>> string, be that a Python byte string or C string, they should be >>> converted as 'UTF-8'. If a specific web server infrastructure is able >>> to support different encodings, then the WSGI adapter MAY provide a >>> way for a user of the WSGI adapter to customise on a global basis, or >>> on a per value basis what encoding is used, but this is entirely >>> optional. Note that there is no requirement to deal with RFC 2047. >> Ugh. This is where I'm not happy with how WSGI 1 in Python 3 has been >> treated. I think it should be bytes, just like it is in Python 2. > > I still don't understand what is the practical, vs theoretical use > case for that in Python 3. In Python 2 bytes strings work out okay > because url routing rules through whatever means is generally also > going to be defined in terms of byte strings. In Python 3 however, > routing is going to likely default to being defined with strings and > as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING > are going to have to almost immediately be converted to strings from > bytes to apply routing rules anyway. > > Can you expand on what benefits come from and what practical use case > would predominate that would mean that bytes would be the better > option? > >> But if we have an encoding, I guess UTF8 is okay so long as it uses >> PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part >> PEP 383, and putting the encoding that was used into the environment, >> makes transcoding doable. PEP 383 doesn't allow for transcoding >> unless you keep track of the encoding used, so we have to store that >> in the environment. > > Again, what practical use cases are there where transcoding would be > necessary, especially if it was a requirement that the WSGI > adapter/server at lowest level, if it makes sense for that server > infrastructure, ie., can support something other than UTF-8, to > provide an option to supply WSGI environ values, all or selected, > interpreted as a different encoding? > > If the option is at the WSGI adapter/server level and managed at the > point of original translation from bytes, then a WSGI application or > middleware doesn't need to worry about it. As such, noting what > encoding was used in the environment serves no purpose except for > information purposes. Marking what encoding was used also would not > necessarily be straight forward if the WSGI adapter/server provided a > way of overriding encoding used for specific values, because one value > for encoding indicator would not suffice. > > To allow experimentation with encoding of values, current mod_wsgi > code allowed overriding of values on global or individual basis. This > was done via an Apache directive, but as had to pass this information > from main Apache worker process to mod_wsgi daemon process, did it in > such a way that also visible to application for information purposes > at this point. Was using convention as follows. > > # Override encoding for everything to UTF-8. > mod_wsgi.variable_encoding: UTF-8 > > # Override encoding and pass raw byes for everything. > mod_wsgi.variable_encoding: - > > # Override encoding of specific value to UTF-8. > mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8 > > # Override encoding and pass raw bytes for specific value. > mod_wsgi.variable_encoding.SCRIPT_NAME: - > > If default encoding used for everything, then no value passed at all. > > In respect of passing bytes for values, we get back to argument from > past discussions as to what should be passed as bytes. Do you only do > SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific > variables such as REQUEST_URI? What about headers such as Referrer? > What about custom user values set using something like SetEnv > directive in Apache? > > This is where it started to turn into a can of worms last time. You > either treat everything as UTF-8 to be consistent, or use bytes for > everything, in which case a great deal more work is put onto WSGI > applications even for potentially simple stuff, effectively forcing > the use of high level request wrappers like WebOb or request object in > Werkzeug. > > In summary, what are the practical uses cases that would make passing > bytes over UTF-8 or even latin-1 worthwhile? > > If passing bytes, what values should be passed as bytes and what left alone? > > What practical use cases are there that would necessitate transcoding?
It's probably harder for newbies to understand transcoding, and converting bytes to string, and vice-versa. I think that count as a practical use case so that high-level frameworks can do some wrapping around, thus potentially making the WSGI spec significantly harder to implement in derivatives works. Thus, I'd not recommend to make WSGI 2 more obfuscated than necessary, unless supported by real-case scenarios as Graham suggested. Hoping not to have leaked too much fuel on the fire.. ;) Etienne -- Etienne Robillard <robillard.etie...@gmail.com> Green Tea Hackers Club <http://gthc.org/> Blog: <http://gthc.org/blog/> PGP Fingerprint: AED6 B33B B41D 5F4F A92A 2B71 874C FB27 F3A9 BDCC _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com