Henry Precheur wrote: > On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: > > It looks simpler until you have a site that is not primarily utf-8. > > In that case, you multiply your (1 line * number of middlewares in the > > WSGI > > stack * each request). > > With wsgi.uri_encoding you get either (1 line * 1 > > middleware designed to transcode * each request), or even 0 if your > > whole site uses just one charset. > > I am not sure I understand your point. > > The 0 lines hold true if the whole site is using latin-1 or utf-8 and > you write your applications/middlewares only for this site. But if it's > using any other encoding you still have to transcode. > > def middleware(start_response, environ): > value = environ['some_key'].\ > encode('utf8', 'surrogateescape').\ > decode(SITE_ENCODING) > ...
Yes; you have to transcode to the "correct" encoding. Once. Then every other WSGI application interface "below" that one doesn't have to care. > With wsgi.uri_encoding you would still have to do the following: > > def middleware(start_response, environ): > value = environ['some_key'].\ > encode(environ['some_key.encoding']).\ > decode(SITE_ENCODING) > ... > > Of course you can directly use `environ['some_key']` if you know you'll > get the 'right' encoding all the time. But when the encoding changes, > you'll have to fix all your middlewares. The decoding doesn't change spontaneously. You either get the correct one or you get an incorrect one. If it's incorrect, you fix it, one time, via a WSGI component which you've configured to determine the "correct" decoding. Then every other WSGI component "below" that one can go back to trusting the decoding was correct. In fact, if you do that transcoding right away, no other WSGI components need to be rewritten to take advantage of unicode. You just have to deploy a single transcoder, that's 6 lines of code max. I know PJE will chime in here and say you can't deploy a website that works differently if you happen to forget to turn on a given piece of middleware, but I also know the rest of you will drown him out from personal experience because you've *never* done that. ;) With utf8+surrogateescape, you don't transcode once, you transcode in every WSGI component in your stack that needs to "correct" the decoding. You have to do it more than once because, each time you encode/re-decode, you use the result and then throw it away. Any subsequent WSGI components have to encode/re-decode--you cannot store the redecoded URI in SCRIPT_NAME/PATH_INFO, because the utf8+surrogateescape scheme says...well, it's always utf8-decoded. In addition, *every* component that needs to compare URI's then has to be configured with the same logic, however convoluted, to perform the "correct" decoding again. It's not just routing middleware: caches need to reliably compare decoded URI's; so do sessions; so does auth (especially!); so do static files. And Heaven forfend you actually decode differently in two different components! Robert Brewer fuman...@aminus.org _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com