Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Alan Kennedy
[Alan]
 The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
 used simply as an identity encoding that also enforces that all
 bytes in the string have a value from 0x00 to 0xff, so that they are
 suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
 a byte-oriented protocol. The problem is the python-the-language
 didn't have support for bytes at the time WSGI was designed.

[Thomas]
 If you're talking about the output stream, then yes, it's all about
 bytes (or should be).

Indeed, I was only talking about output, specifically the response body.

 But at the status and headers level, HTTP/1.1 is
 fundamentally ISO-8859-1-encoded.

Agreed.

That is why the WSGI spec also states


Note also that strings passed to start_response() as a status or as
response headers must follow RFC 2616 with respect to encoding. That
is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME
encoding.


So in order to use non-ISO-8859-1 characters in response status
strings or headers, you must use RFC 2047.

As confirmed by the links you posted, this is a HTTP restriction, not
a WSGI restriction.

Regards,

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Alan Kennedy
[Phillip]
 WSGI already copes, actually.  Note that Jython and IronPython have
 this issue today, and see:

 http://www.python.org/dev/peps/pep-0333/#unicode-issues

[James]
 It would seem very odd, however, for WSGI/python3 to use strings-
 restricted-to-0xFF for network I/O while everywhere else in python3 is
 going to use bytes for the same purpose.

I think it's worth pointing out the reason for the current restriction
to iso-8859-1 is *because* python did not have a bytes type at the
time the WSGI spec was drawn up. IIRC, the bytes type had not yet even
been proposed for Py3K. Cpython effectively held all byte sequences as
strings, a paradigm which is (still) followed by jython (not sure
about ironpython).

The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
used simply as an identity encoding that also enforces that all
bytes in the string have a value from 0x00 to 0xff, so that they are
suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
a byte-oriented protocol. The problem is the python-the-language
didn't have support for bytes at the time WSGI was designed.

[James]
 You'd have to modify your app
 to call write(unicodetext.encode('utf-8').decode('latin-1')) or so

Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))?

Either way, the second encode is not required;
write(unicodetext.encode('utf-8')) is sufficient, since it will
generate a byte-sequence(string) which will (actually should: see
(*) note below) pass the following test.

try:
   wsgi_response_data.encode('iso-8859-1')
except UnicodeError:
   # Illegal WSGI response data!

On a side note, it's worth noting that Philip Jenvey's excellent
rework of the jython IO subsystem to use java.nio is fundamentally
byte oriented.

http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html
http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io

Because it is based on the new IO design for Python 3K, as described in PEP 3116

http://www.python.org/dev/peps/pep-3116/

Regards,

Alan.

[*] Although I notice that cpython 2.5, for a reason I don't fully
understand, fails this particular encoding sequence. (Maybe it's to do
with the possibility that the result of an encode operation is no
longer an encodable string?)

Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 response = uinterferon-gamma (IFN-\u03b3) responses in cattle
 response.encode('utf-8').encode('latin-1')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position
22: ordinal not in range(128)


Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5,
you would have to carry out this rigmarole

 response.encode('utf-8').decode('latin-1').encode('latin-1')
'interferon-gamma (IFN-\xce\xb3) responses in cattle'


Perhaps this behaviour is an artifact of the cpython implementation?

Whereas jython passes it just fine (and correctly, IMHO)

Jython 2.2.1 on java1.4.2_15
Type copyright, credits or license for more information.
 response = uinterferon-gamma (IFN-\u03b3) responses in cattle
 response.encode('utf-8')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'
 response.encode('utf-8').encode('latin-1')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Ian Bicking
Phillip J. Eby wrote:
 So here are my recommendations so far for the addendum to WSGI *1.0* for 
 Python 3.0 (I expect we can be more strict for WSGI 2.0):
 
 * When running under Python 3, applications SHOULD produce bytes output 
 and headers
 
 * When running under Python 3, servers and gateways MUST accept strings 
 as application output or headers, under the existing rules (i.e., 
 s.encode('latin-1') must convert the string to bytes without an exception)
 
 * When running under Python 3, servers MUST provide CGI HTTP variables 
 as strings, decoded from the headers using HTTP standard encodings (i.e. 
 latin-1 + RFC 2047)  (Open question: are there any CGI or WSGI variables 
 that should NOT be strings?)

I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. 
  That is, after you urldecode the values (as WSGI asks you to do) 
proper conversion to text is to decode it as UTF8.

I'm a bit confused on how HTTP_COOKIE gets encoded.  And QUERY_STRING 
also confuses me.

Is this all compatible with os.environ in py3k?  I don't care that much 
if it does, but as the starting point for CGI it would be interesting if 
it stays in sync.

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Phillip J. Eby
So here are my recommendations so far for the addendum to WSGI *1.0* 
for Python 3.0 (I expect we can be more strict for WSGI 2.0):

* When running under Python 3, applications SHOULD produce bytes 
output and headers

* When running under Python 3, servers and gateways MUST accept 
strings as application output or headers, under the existing rules 
(i.e., s.encode('latin-1') must convert the string to bytes without 
an exception)

* When running under Python 3, servers MUST provide CGI HTTP 
variables as strings, decoded from the headers using HTTP standard 
encodings (i.e. latin-1 + RFC 2047)  (Open question: are there any 
CGI or WSGI variables that should NOT be strings?)

* When running under Python 3, servers MUST make wsgi.input a binary 
(byte) stream

* When running under Python 3, servers MUST provide a text stream for 
wsgi.errors

These rules are intended to simplify the porting of existing 
code.  Notice, for example, that these rules allow middleware to pass 
strings through unchanged, since they are not required to produce 
bytes output or headers.

Unfortunately, wsgi.input can't be coded around, but for most 
frameworks this should be a single point of pain.  In fact, if the 
'cgi' stdlib module is made compatible with bytes, only the rare 
framework that rolls its own multipart parser or otherwise directly 
manipulates put/post data will be affected.  Code that just takes the 
input and writes it to a file won't be bothered, either.

Comments or questions?

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread James Y Knight

On Dec 7, 2007, at 5:46 PM, Andrew Clover wrote:
 OTOH making the dictionaries reflect the underlying OS's conception of
 environment variables means users of os.environ and WSGI will have  
 to be
 able to cope with both bytes and unicode, which would also be a big
 annoyance.

 In summary: urgh, this is all messy and 'orrible.

I suppose this is more a question for python-dev, but, it'd be really  
nice if Python on Windows made it look like the windows system  
encoding was always UTF-8. That is, bytestrings used for open/ 
os.environ/argv/etc. are always encoded/decoded in utf-8, not the  
broken-platform-encoding. Then the same code would work just as well  
on unix as it does on windows.

Actually, I bet I could implement that today, just by wrapping some  
stuffhmmm...

James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
James Y Knight wrote:

 In addition, I know of nobody who actually implements RFC 2047
 decoding of http header values...nothing really uses it. (of
 course I don't know of all implementations out there.)

Certainly no browser supports it, which makes the point moot for WSGI. 
Most browsers, when quoting a header parameter, simply encode using the 
previous page's charset and put quotes around it... even if the 
parameter has a quote or control codes in it.

Ian wrote:

  Is this all compatible with os.environ in py3k?

In 3.0a2 os.environ has Unicode strings for both keys and values. This 
is correct for Windows where environment variables are explicitly 
Unicode, but questionable (IMO) for Unix where they're really bytes that 
may or may not represent decodeable Unicode strings.

 SCRIPT_NAME/PATH_INFO

This already causes problems in Windows CGI applications! Because these 
are passed in environment variables, IIS* has to decode the submitted 
bytes to Unicode first. It seems always to choose UTF-8 for this job, 
which I suppose is the least bad guess, but hardly infallible.

(* - haven't tested this with Apache for Windows yet.)

In Python 2.x, os.environ being byte strings, Python/the C library then 
has to encode them back to bytes, which I believe ends up using the 
system codepage. Since the system codepage is never UTF-8 on Windows 
this means not only that the bytes read back from eg. PATH_INFO are not 
the same as the original bytes submitted to the web server, but that if 
there are characters outside the system codepage submitted, they'll be 
unrecoverable.

If os.environ remains Unicode in Unix and WSGI follows it (as it must if 
CGI-invoked WSGI is to continue working smoothly), webapps that try to 
allow for non-ASCII characters in URLs are likely to get some nasty 
deployment problems that depend on the system encoding setting, 
something that will be particularly troublesome for end-users to debug 
and fix.

OTOH making the dictionaries reflect the underlying OS's conception of 
environment variables means users of os.environ and WSGI will have to be 
able to cope with both bytes and unicode, which would also be a big 
annoyance.

In summary: urgh, this is all messy and 'orrible.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
Adam Atlas [EMAIL PROTECTED] wrote:

 I'd say it would be best to only accept `bytes` objects

+1. HTTP is inherently byte-based. Any translation between bytes and 
unicode characters should be done at a higher level, by whatever web 
framework is living above WSGI.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread James Y Knight

On Dec 7, 2007, at 2:55 PM, Phillip J. Eby wrote:

 * When running under Python 3, servers MUST provide CGI HTTP
 variables as strings, decoded from the headers using HTTP standard
 encodings (i.e. latin-1 + RFC 2047)  (Open question: are there any
 CGI or WSGI variables that should NOT be strings?)

A WSGI gateway should *not* decode headers using RFC 2047. It actually  
*cannot*, without knowing the structure of that particular header,  
because only TEXT tokens are encoded that way. In addition, I know of  
nobody who actually implements RFC 2047 decoding of http header  
values...nothing really uses it. (of course I don't know of all  
implementations out there.)


On Dec 7, 2007, at 3:24 PM, Ian Bicking wrote:

 I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not  
 latin1.
  That is, after you urldecode the values (as WSGI asks you to do)
 proper conversion to text is to decode it as UTF8.

Surely not! URLs aren't always utf-8 encoded, only often.

James


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com