Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-10 Thread Guido van Rossum
On Dec 9, 2007 7:56 PM, Graham Dumpleton [EMAIL PROTECTED] wrote:
 On 09/12/2007, Guido van Rossum [EMAIL PROTECTED] wrote:
  On Dec 8, 2007 12:37 AM, Graham Dumpleton [EMAIL PROTECTED] wrote:
   On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote:
* When running under Python 3, servers MUST provide a text stream for
wsgi.errors
  
   In Python 3, what happens if user code attempts to output to a text
   stream a byte string? Ie., what would be displayed?
 
  Nothing. You get a TypeError.

 Hmmm, this in itself could be quite a pain for existing code where
 people have added debug code to print out details from request headers
 (if now to be passed as bytes), or part of the request content.

Sorry, I was just talking about the write() method on a text stream.
The print() function in 3.0 will print the repr() of the bytes.
Example:

Python 3.0a2 (py3k, Dec 10 2007, 09:38:42)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type help, copyright, credits or license for more information.
 a = bxyz
 print(a)
b'xyz'
 b = babc\377def
 print(b)
b'abc\xffdef'


(Note that this works because print() always calls str() on the
argument and bytes.str is defined to be the same as bytes.repr.)

 What is the suggested way of best dumping out bytes for debugging
 purposes so one does not have to worry about encoding issues, just use
 repr()?

Just use print().

   Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
   has to internally map this to a C char* like API for logging that it
   would need to apply standard Python encoding to yield usable char*
   string for output.
 
  The encoding can/must be specified per text stream.

 But what should the encoding associated with the wsgi.errors stream be?

Depends on the platform and your requirements.

 If code which outputs text to wsgi.errors can use any valid Unicode
 character, if one sets it to US-ASCII encoding, then chance that
 logging output will fail because of characters not being valid in that
 character set. If one instead uses UTF-8, then potentially have issues
 where that byte string coming out other end of text stream is passed
 to C API functions. Issues might arise here where C API not expecting
 variable width character encoding.

 I'll freely admit I am not across all this Unicode encode/decode stuff
 as I don't generally have to deal with foreign languages, but seems to
 be a few missing details in this area which need to be filled out for
 a modified WSGI specification.

The goal of this part of Py3k is to make it more obvious when you
haven't thought through your encoding issues enough by failing as soon
as (encoded) bytes meet (decoded) characters.

Of course, you can still run into delayed trouble by using an
inappropriate encoding, which only shows up when there is an actual
encoding or decoding error; but at least you will have carefully
distinguished between encoded and decoded text throughout your
program, so the fix is now to change the encoding rather than having
to restructure your code to properly separate encoded and decoded
text.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-08 Thread Graham Dumpleton
On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote:
 * When running under Python 3, servers MUST provide a text stream for
 wsgi.errors

In Python 3, what happens if user code attempts to output to a text
stream a byte string? Ie., what would be displayed?

Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
has to internally map this to a C char* like API for logging that it
would need to apply standard Python encoding to yield usable char*
string for output.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-08 Thread Guido van Rossum
On Dec 8, 2007 12:37 AM, Graham Dumpleton [EMAIL PROTECTED] wrote:
 On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote:
  * When running under Python 3, servers MUST provide a text stream for
  wsgi.errors

 In Python 3, what happens if user code attempts to output to a text
 stream a byte string? Ie., what would be displayed?

Nothing. You get a TypeError.

 Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
 has to internally map this to a C char* like API for logging that it
 would need to apply standard Python encoding to yield usable char*
 string for output.

The encoding can/must be specified per text stream.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Alan Kennedy
[Alan]
 The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
 used simply as an identity encoding that also enforces that all
 bytes in the string have a value from 0x00 to 0xff, so that they are
 suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
 a byte-oriented protocol. The problem is the python-the-language
 didn't have support for bytes at the time WSGI was designed.

[Thomas]
 If you're talking about the output stream, then yes, it's all about
 bytes (or should be).

Indeed, I was only talking about output, specifically the response body.

 But at the status and headers level, HTTP/1.1 is
 fundamentally ISO-8859-1-encoded.

Agreed.

That is why the WSGI spec also states


Note also that strings passed to start_response() as a status or as
response headers must follow RFC 2616 with respect to encoding. That
is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME
encoding.


So in order to use non-ISO-8859-1 characters in response status
strings or headers, you must use RFC 2047.

As confirmed by the links you posted, this is a HTTP restriction, not
a WSGI restriction.

Regards,

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Alan Kennedy
[Phillip]
 WSGI already copes, actually.  Note that Jython and IronPython have
 this issue today, and see:

 http://www.python.org/dev/peps/pep-0333/#unicode-issues

[James]
 It would seem very odd, however, for WSGI/python3 to use strings-
 restricted-to-0xFF for network I/O while everywhere else in python3 is
 going to use bytes for the same purpose.

I think it's worth pointing out the reason for the current restriction
to iso-8859-1 is *because* python did not have a bytes type at the
time the WSGI spec was drawn up. IIRC, the bytes type had not yet even
been proposed for Py3K. Cpython effectively held all byte sequences as
strings, a paradigm which is (still) followed by jython (not sure
about ironpython).

The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
used simply as an identity encoding that also enforces that all
bytes in the string have a value from 0x00 to 0xff, so that they are
suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
a byte-oriented protocol. The problem is the python-the-language
didn't have support for bytes at the time WSGI was designed.

[James]
 You'd have to modify your app
 to call write(unicodetext.encode('utf-8').decode('latin-1')) or so

Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))?

Either way, the second encode is not required;
write(unicodetext.encode('utf-8')) is sufficient, since it will
generate a byte-sequence(string) which will (actually should: see
(*) note below) pass the following test.

try:
   wsgi_response_data.encode('iso-8859-1')
except UnicodeError:
   # Illegal WSGI response data!

On a side note, it's worth noting that Philip Jenvey's excellent
rework of the jython IO subsystem to use java.nio is fundamentally
byte oriented.

http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html
http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io

Because it is based on the new IO design for Python 3K, as described in PEP 3116

http://www.python.org/dev/peps/pep-3116/

Regards,

Alan.

[*] Although I notice that cpython 2.5, for a reason I don't fully
understand, fails this particular encoding sequence. (Maybe it's to do
with the possibility that the result of an encode operation is no
longer an encodable string?)

Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win32
Type help, copyright, credits or license for more information.
 response = uinterferon-gamma (IFN-\u03b3) responses in cattle
 response.encode('utf-8').encode('latin-1')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position
22: ordinal not in range(128)


Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5,
you would have to carry out this rigmarole

 response.encode('utf-8').decode('latin-1').encode('latin-1')
'interferon-gamma (IFN-\xce\xb3) responses in cattle'


Perhaps this behaviour is an artifact of the cpython implementation?

Whereas jython passes it just fine (and correctly, IMHO)

Jython 2.2.1 on java1.4.2_15
Type copyright, credits or license for more information.
 response = uinterferon-gamma (IFN-\u03b3) responses in cattle
 response.encode('utf-8')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'
 response.encode('utf-8').encode('latin-1')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Ian Bicking
Phillip J. Eby wrote:
 So here are my recommendations so far for the addendum to WSGI *1.0* for 
 Python 3.0 (I expect we can be more strict for WSGI 2.0):
 
 * When running under Python 3, applications SHOULD produce bytes output 
 and headers
 
 * When running under Python 3, servers and gateways MUST accept strings 
 as application output or headers, under the existing rules (i.e., 
 s.encode('latin-1') must convert the string to bytes without an exception)
 
 * When running under Python 3, servers MUST provide CGI HTTP variables 
 as strings, decoded from the headers using HTTP standard encodings (i.e. 
 latin-1 + RFC 2047)  (Open question: are there any CGI or WSGI variables 
 that should NOT be strings?)

I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. 
  That is, after you urldecode the values (as WSGI asks you to do) 
proper conversion to text is to decode it as UTF8.

I'm a bit confused on how HTTP_COOKIE gets encoded.  And QUERY_STRING 
also confuses me.

Is this all compatible with os.environ in py3k?  I don't care that much 
if it does, but as the starting point for CGI it would be interesting if 
it stays in sync.

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Phillip J. Eby
So here are my recommendations so far for the addendum to WSGI *1.0* 
for Python 3.0 (I expect we can be more strict for WSGI 2.0):

* When running under Python 3, applications SHOULD produce bytes 
output and headers

* When running under Python 3, servers and gateways MUST accept 
strings as application output or headers, under the existing rules 
(i.e., s.encode('latin-1') must convert the string to bytes without 
an exception)

* When running under Python 3, servers MUST provide CGI HTTP 
variables as strings, decoded from the headers using HTTP standard 
encodings (i.e. latin-1 + RFC 2047)  (Open question: are there any 
CGI or WSGI variables that should NOT be strings?)

* When running under Python 3, servers MUST make wsgi.input a binary 
(byte) stream

* When running under Python 3, servers MUST provide a text stream for 
wsgi.errors

These rules are intended to simplify the porting of existing 
code.  Notice, for example, that these rules allow middleware to pass 
strings through unchanged, since they are not required to produce 
bytes output or headers.

Unfortunately, wsgi.input can't be coded around, but for most 
frameworks this should be a single point of pain.  In fact, if the 
'cgi' stdlib module is made compatible with bytes, only the rare 
framework that rolls its own multipart parser or otherwise directly 
manipulates put/post data will be affected.  Code that just takes the 
input and writes it to a file won't be bothered, either.

Comments or questions?

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread James Y Knight

On Dec 7, 2007, at 5:46 PM, Andrew Clover wrote:
 OTOH making the dictionaries reflect the underlying OS's conception of
 environment variables means users of os.environ and WSGI will have  
 to be
 able to cope with both bytes and unicode, which would also be a big
 annoyance.

 In summary: urgh, this is all messy and 'orrible.

I suppose this is more a question for python-dev, but, it'd be really  
nice if Python on Windows made it look like the windows system  
encoding was always UTF-8. That is, bytestrings used for open/ 
os.environ/argv/etc. are always encoded/decoded in utf-8, not the  
broken-platform-encoding. Then the same code would work just as well  
on unix as it does on windows.

Actually, I bet I could implement that today, just by wrapping some  
stuffhmmm...

James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
James Y Knight wrote:

 In addition, I know of nobody who actually implements RFC 2047
 decoding of http header values...nothing really uses it. (of
 course I don't know of all implementations out there.)

Certainly no browser supports it, which makes the point moot for WSGI. 
Most browsers, when quoting a header parameter, simply encode using the 
previous page's charset and put quotes around it... even if the 
parameter has a quote or control codes in it.

Ian wrote:

  Is this all compatible with os.environ in py3k?

In 3.0a2 os.environ has Unicode strings for both keys and values. This 
is correct for Windows where environment variables are explicitly 
Unicode, but questionable (IMO) for Unix where they're really bytes that 
may or may not represent decodeable Unicode strings.

 SCRIPT_NAME/PATH_INFO

This already causes problems in Windows CGI applications! Because these 
are passed in environment variables, IIS* has to decode the submitted 
bytes to Unicode first. It seems always to choose UTF-8 for this job, 
which I suppose is the least bad guess, but hardly infallible.

(* - haven't tested this with Apache for Windows yet.)

In Python 2.x, os.environ being byte strings, Python/the C library then 
has to encode them back to bytes, which I believe ends up using the 
system codepage. Since the system codepage is never UTF-8 on Windows 
this means not only that the bytes read back from eg. PATH_INFO are not 
the same as the original bytes submitted to the web server, but that if 
there are characters outside the system codepage submitted, they'll be 
unrecoverable.

If os.environ remains Unicode in Unix and WSGI follows it (as it must if 
CGI-invoked WSGI is to continue working smoothly), webapps that try to 
allow for non-ASCII characters in URLs are likely to get some nasty 
deployment problems that depend on the system encoding setting, 
something that will be particularly troublesome for end-users to debug 
and fix.

OTOH making the dictionaries reflect the underlying OS's conception of 
environment variables means users of os.environ and WSGI will have to be 
able to cope with both bytes and unicode, which would also be a big 
annoyance.

In summary: urgh, this is all messy and 'orrible.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
Adam Atlas [EMAIL PROTECTED] wrote:

 I'd say it would be best to only accept `bytes` objects

+1. HTTP is inherently byte-based. Any translation between bytes and 
unicode characters should be done at a higher level, by whatever web 
framework is living above WSGI.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread James Y Knight

On Dec 7, 2007, at 2:55 PM, Phillip J. Eby wrote:

 * When running under Python 3, servers MUST provide CGI HTTP
 variables as strings, decoded from the headers using HTTP standard
 encodings (i.e. latin-1 + RFC 2047)  (Open question: are there any
 CGI or WSGI variables that should NOT be strings?)

A WSGI gateway should *not* decode headers using RFC 2047. It actually  
*cannot*, without knowing the structure of that particular header,  
because only TEXT tokens are encoded that way. In addition, I know of  
nobody who actually implements RFC 2047 decoding of http header  
values...nothing really uses it. (of course I don't know of all  
implementations out there.)


On Dec 7, 2007, at 3:24 PM, Ian Bicking wrote:

 I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not  
 latin1.
  That is, after you urldecode the values (as WSGI asks you to do)
 proper conversion to text is to decode it as UTF8.

Surely not! URLs aren't always utf-8 encoded, only often.

James


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Phillip J. Eby
At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote:
Has anyone had any thoughts about how WSGI is going to made to work
with Python 3?

 From what I understand about changes in Python 3, the main issue seems
to be the removal of string type in its current form.

This is an issue as WSGI specification currently states that status,
header names/values and the items returned by the iterable must all be
string instances. This is done to ensure that the application has done
any conversions from Unicode, where knowledge about encoding would be
known, before being passed to WSGI adapter.

In Python 3 the default for string type objects will effectively be
Unicode. Is WSGI going to be made to somehow cope with that, or will
application instead be required to return byte string objects instead?

WSGI already copes, actually.  Note that Jython and IronPython have 
this issue today, and see:

http://www.python.org/dev/peps/pep-0333/#unicode-issues

On Python platforms where the str or StringType type is in fact 
Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all 
strings referred to in this specification must contain only code 
points representable in ISO-8859-1 encoding (\u through \u00FF, 
inclusive). It is a fatal error for an application to supply strings 
containing any other Unicode character or code point. Similarly, 
servers and gateways must not supply strings to an application 
containing any other Unicode characters.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Guido van Rossum
On Dec 6, 2007 4:15 PM, Phillip J. Eby [EMAIL PROTECTED] wrote:
 At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote:
 Has anyone had any thoughts about how WSGI is going to made to work
 with Python 3?
 
  From what I understand about changes in Python 3, the main issue seems
 to be the removal of string type in its current form.
 
 This is an issue as WSGI specification currently states that status,
 header names/values and the items returned by the iterable must all be
 string instances. This is done to ensure that the application has done
 any conversions from Unicode, where knowledge about encoding would be
 known, before being passed to WSGI adapter.
 
 In Python 3 the default for string type objects will effectively be
 Unicode. Is WSGI going to be made to somehow cope with that, or will
 application instead be required to return byte string objects instead?

 WSGI already copes, actually.  Note that Jython and IronPython have
 this issue today, and see:

 http://www.python.org/dev/peps/pep-0333/#unicode-issues

 On Python platforms where the str or StringType type is in fact
 Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
 strings referred to in this specification must contain only code
 points representable in ISO-8859-1 encoding (\u through \u00FF,
 inclusive). It is a fatal error for an application to supply strings
 containing any other Unicode character or code point. Similarly,
 servers and gateways must not supply strings to an application
 containing any other Unicode characters.

That may work for IronPython/Jython, where encoded data is represented
by the str type, but it won't be sufficient for Py3k, where encoded
data is represented using the bytes type. IOW, in IronPython/Jython,
u\u1234.encode('utf-8') returns a str instance: '\xe1\x88\xb4'; but
in Py3k, it returns a bytes instance: b'\xe1\x88\xb4'.

The issue applies to input as well as output -- data read from a
socket is also represented as bytes, unless you're using makefile()
with a text mode and an encoding.

You might want to look at how the unittests for wsgiref manage to pass
in Py3k though. ;-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread James Y Knight

On Dec 6, 2007, at 7:15 PM, Phillip J. Eby wrote:
 WSGI already copes, actually.  Note that Jython and IronPython have
 this issue today, and see:

 http://www.python.org/dev/peps/pep-0333/#unicode-issues

 On Python platforms where the str or StringType type is in fact
 Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
 strings referred to in this specification must contain only code
 points representable in ISO-8859-1 encoding (\u through \u00FF,
 inclusive). It is a fatal error for an application to supply strings
 containing any other Unicode character or code point. Similarly,
 servers and gateways must not supply strings to an application
 containing any other Unicode characters.

It would seem very odd, however, for WSGI/python3 to use strings- 
restricted-to-0xFF for network I/O while everywhere else in python3 is  
going to use bytes for the same purpose. You'd have to modify your app  
to call write(unicodetext.encode('utf-8').decode('latin-1')) or so

James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Phillip J. Eby
At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:

On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
  In Python 3 the default for string type objects will effectively be
  Unicode. Is WSGI going to be made to somehow cope with that, or will
  application instead be required to return byte string objects instead?

I'd say it would be best to only accept `bytes` objects; anything else
would require some guesswork. Maybe, at most, it could try to encode
returned Unicode objects as ISO-8859-1, and have it be an error if
that's not possible.

Actually, I'd prefer to look at it the other way around: a Python 3 
WSGI server or middleware *may* accept bytes objects instead of str.

This is relatively easy for the response side of things, but the 
request side is rather more difficult, since wsgi.input may need to 
be binary rather than text mode.  (I think we can reasonably assume 
that wsgi.errors is a text mode stream, and should support a 
reasonable encoding.)

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Adam Atlas

On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
 In Python 3 the default for string type objects will effectively be
 Unicode. Is WSGI going to be made to somehow cope with that, or will
 application instead be required to return byte string objects instead?

I'd say it would be best to only accept `bytes` objects; anything else  
would require some guesswork. Maybe, at most, it could try to encode  
returned Unicode objects as ISO-8859-1, and have it be an error if  
that's not possible.

I was going to say that the gateway could accept Unicode objects if  
the user-agent sent a comprehensible Accept-Charset header, and  
thereby encode application output to the client's preferred character  
set on the fly (or to ISO-8859-1 if no Accept-Charset is provided),  
but that would complicate things for people writing gateways (and  
would be too implicit). It could be useful, but it would make more  
sense as a simple decorator for (almost-)WSGI applications. Perhaps it  
could go in wsgiref.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread James Bennett
On Dec 6, 2007 6:15 PM, Phillip J. Eby [EMAIL PROTECTED] wrote:
 WSGI already copes, actually.  Note that Jython and IronPython have
 this issue today, and see:

 http://www.python.org/dev/peps/pep-0333/#unicode-issues

I'm glad you brought that up, because it's been bugging me lately.

That section is somewhat ambiguous as-is, because in one sentence
applications are permitted to return strings encoded in a charset
other than ISO-8859-1, but in another they are unequivocally forbidden
to do so (with the must not in bold, even). And that's problematic
not only because of the ambiguity, but because the increasing
popularity of AJAX and web-based APIs is making it much more common
for WSGI applications to generate responses of types which do not
default to ISO-8859-1 -- e.g., XML and JSON, both of which default to
UTF-8.

Depending on how draconian one wishes to be when reading the relevant
section of WSGI, it's possible to conclude that XML and JSON must
always be transcoded/escaped to ISO-8859-1 -- with all the headaches
that entails -- before being passed to a WSGI-compliant piece of
software.

And the slightly less strict reading of the spec -- that such
gymnastics are required only when the string type of the Python
implementation is Unicode-based -- will grow increasingly troublesome
as/when Py3K enters production use.

So as long as we're talking about this, could the proscriptions with
respect to encoding perhaps be revisited and (hopefully)
clarified/revised?

-- 
Bureaucrat Conrad, you are technically correct -- the best kind of correct.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Ian Bicking
Phillip J. Eby wrote:
 At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:
 
 On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
 In Python 3 the default for string type objects will effectively be
 Unicode. Is WSGI going to be made to somehow cope with that, or will
 application instead be required to return byte string objects instead?
 I'd say it would be best to only accept `bytes` objects; anything else
 would require some guesswork. Maybe, at most, it could try to encode
 returned Unicode objects as ISO-8859-1, and have it be an error if
 that's not possible.
 
 Actually, I'd prefer to look at it the other way around: a Python 3 
 WSGI server or middleware *may* accept bytes objects instead of str.
 
 This is relatively easy for the response side of things, but the 
 request side is rather more difficult, since wsgi.input may need to 
 be binary rather than text mode.  (I think we can reasonably assume 
 that wsgi.errors is a text mode stream, and should support a 
 reasonable encoding.)

wsgi.input definitely seems like it should be bytes to me.  Unless we 
want to put the encoding process into the server.  Not entirely 
infeasible, but a bit of a strain.  And the request body might very well 
be binary, e.g., on a PUT.

The CGI keys in the environment don't feel at all like bytes to me, but 
then they aren't unicode either.  They can be unicode, again given a bit 
of work on the server side.  Though unfortunately browsers are very poor 
at indicating their encoding for requests, and it ends up being policy 
and configuration as much as anything that determines the encoding of 
stuff like wsgi.input.  I believe all request paths are UTF8 (?), but 
I'm not sure about QUERY_STRING.  I'm a little fuzzy on some of the 
details there.

The actual response body should also be bytes.  Unless again we want to 
introduce upstream encoding.

This does make everything feel more complicated.

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Guido van Rossum
On Dec 6, 2007 8:00 PM, Ian Bicking [EMAIL PROTECTED] wrote:
 Phillip J. Eby wrote:
  At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:
 
  On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
  In Python 3 the default for string type objects will effectively be
  Unicode. Is WSGI going to be made to somehow cope with that, or will
  application instead be required to return byte string objects instead?
  I'd say it would be best to only accept `bytes` objects; anything else
  would require some guesswork. Maybe, at most, it could try to encode
  returned Unicode objects as ISO-8859-1, and have it be an error if
  that's not possible.
 
  Actually, I'd prefer to look at it the other way around: a Python 3
  WSGI server or middleware *may* accept bytes objects instead of str.
 
  This is relatively easy for the response side of things, but the
  request side is rather more difficult, since wsgi.input may need to
  be binary rather than text mode.  (I think we can reasonably assume
  that wsgi.errors is a text mode stream, and should support a
  reasonable encoding.)

 wsgi.input definitely seems like it should be bytes to me.  Unless we
 want to put the encoding process into the server.  Not entirely
 infeasible, but a bit of a strain.  And the request body might very well
 be binary, e.g., on a PUT.

 The CGI keys in the environment don't feel at all like bytes to me, but
 then they aren't unicode either.  They can be unicode, again given a bit
 of work on the server side.  Though unfortunately browsers are very poor
 at indicating their encoding for requests, and it ends up being policy
 and configuration as much as anything that determines the encoding of
 stuff like wsgi.input.  I believe all request paths are UTF8 (?), but
 I'm not sure about QUERY_STRING.  I'm a little fuzzy on some of the
 details there.

 The actual response body should also be bytes.  Unless again we want to
 introduce upstream encoding.

 This does make everything feel more complicated.

It's the same level of complexity you run into as soon as you want to
handle Unicode with WSGI in 2.x though, as it is caused by something
outside our control (HTTP and browsers).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com