On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote: > The first issue is that there doesn't seem to be a way to parse > x-www-form-urlencoded query strings in a character set other than > UTF-8, for example: > > 'premier=un&deuxi%E8me=deux' # latin-1 > > The urllib.parse.unquote* functions take encoding and errors > parameters, but none of the higher-level ones. The solution to me > seems to be that functions that build on top of > it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage > constructor--should grow encoding and errors parameters that they pass > through to the lower-level functions. > > The second issue is that the FieldStorage classes work with text input > streams. However, with multipart/form-data posts, posted files aren't > necessarily in the same encoding as form fields, or may be binary and > not text at all. I would suggest that FieldStorage should be changed > to take a binary input stream. > > [...]
I'm not quite sure how to interpret the lack of response I've gotten on this topic. Is it just that there's little interest in the cgi module? Should I raise this issue on the python-dev list, or just open a bug report and start submitting patches? There's been a lot of discussion recently about bytes vs. str in email headers and WSGI environ variables, but I haven't been able to find a substantive discussion on this specific topic. Here are some of the related quotes I've come across. Martin v. Löwis wrote [1]: > In a CGI application, you shouldn't be using sys.stdin or print(). > Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw), > and sys.stdout.buffer.raw. A CGI script essentially does binary IO; > if you use TextIO, there likely will be bugs (e.g. if you have > attachments of type application/octet-stream). bobince wrote [2]: > Evan Fosmark wrote: >> bobince wrote: >>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion >>> that hasn't been fixed properly for the new string model. It should >>> be converting the incoming byte stream to characters before >>> passing them to urllib. >>> >>> Did I mention Python 3.0's libraries (especially web-related >>> ones) still being rather shonky? :-) >> >> Yeah. So far I've noticed huge problems with cgi, urllib, and >> wsgiref. I hope they get fixed soon. :( > > Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one > seems to want ownership of the issue. Very disappointing. There's also this bug report[3], but it doesn't directly propose the changes that I have. So: does anyone agree, or disagree, that cgi.FieldStorage should be changed to take byte streams, and many of the cgi and urllib.parse functions should become encoding-aware, preferably in time for Python 3.1? The byte-stream change will break compatibility with with Python 3.0, but I strongly feel that treating POST data as text is wrong and should not continue to be supported. -Miles Kaufmann [1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html [2]: http://stackoverflow.com/questions/540342/python-3-0-urllib [3]: http://bugs.python.org/issue4953 _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com