Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

2006-09-16 Thread A. Pagaltzis
* Juerd <[EMAIL PROTECTED]> [2006-09-16 22:15]:
> My thoughts:
> 
> * Default Content-Type header of "text/html; charset=UTF-8".
> * Default output encoding of UTF-8.
> * When a new Content-Type is set, but no new encoding
> * Keep the default output encoding of UTF-8
> * Warn if it's text/* without /charset=/
> * Use the specified charset as the output encoding
> * Change the output encoding to raw bytes if it's not text/*
> * When a new Content-Type is set, and a new encoding is given
> * Use the supplied encoding
> * Warn if it's text/* without /charset=/
> * Warn if supplied encoding and charset aren't equal enough

I had to read your mail twice to get what you really meant here,
but now that I have, this sounds reasonable.

> I think it's important to realise that only text/* have
> charset, and that Content-Type is MIME-type plus charset in one
> value. We shouldn't be "clever" and separate these: they're one
> string.

Sounds good to me.

Regards,
-- 
Aristotle Pagaltzis // 


Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

2006-09-16 Thread Darren Duncan

At 7:38 PM +0200 9/16/06, A. Pagaltzis wrote:

* Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:

 4.  Make UTF-8 the default HTTP response character encoding,
 and the default declared charset for text/* MIME types, and
 explicitly declare that this is what the charset is.  The only
 time that output should be anything else, even Latin-1, is if
 the programmer specifies such.


No, please don't. For unknown MIME types, the charset should be
undeclared. In particular, `application/octet-stream` should
never have a charset forced on it if one is not assigned by the
client code explicitly. Likewise, for `application/xml` and
`application/*+xml`, a charset should NEVER be explicitly
declared, as XML documents are self-describing, whereas declaring
a charset forces using the charset declared in the HTTP header.
This is very unwise (cf. Ruby's Postulate).


Look again; I was only specifying that a default charset is used for 
text/* MIME types, not non-text/* MIME types; the latter would 
typically have no charset as you say.



 > 5.  Similarly, default to trying to treat the HTTP request as

 UTF-8 if it doesn't specify a character encoding; fallback to
 Latin-1 only if the text parts of the HTTP request don't look
 like valid UTF-8.


This is not just unwise, it is actually wrong. Latin-1 is the
default for `text/*` MIME types if no charset is declared. Using
a different charset in violation of the HTTP RFCs is __BROKEN__.


Okay, I retract that suggestion.  Because the official HTTP spec says 
no-explicit-charset-means-Latin1.



In fact, now that I'm writing all this out, I am starting to
think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
does. Otherwise, the code and API would have to have able to deal
with the full complexity of charsets in HTTP, and the docs would
have to explain it, which is no picnic at all.


I disagree.  Regardless of the details, a Perl 6 replacement for 
CGI.pm *should* handle character set issues.  Its users should simply 
be able to pull out correctly interpreted ready-to-use Str values 
when the HTTP request content type is text, and not have to know 
about what character set was used in the request.  Analagously, if 
the user takes their Str values and supplies them to an HTTP response 
whose content type is text, they should not have to specify an output 
encoding if they don't want to, and UTF-8 is the best default because 
it can handle all possible characters that the Str repetoire can 
represent.


The CGI.pm replacement by no means has to do the dirty work of 
processing encodings itself, such as mapping bytes to chars etc. 
Those details would be handled by something else, such as either Perl 
6 itself or a Perl 6 analogy to Encode.pm.


-- Darren Duncan


Re: Charsets in HTTP (was: the CGI.pm in Perl 6)

2006-09-16 Thread Juerd
A. Pagaltzis skribis 2006-09-16 19:38 (+0200):
> * Darren Duncan <[EMAIL PROTECTED]> [2006-09-09 20:40]:
> > 4.  Make UTF-8 the default HTTP response character encoding, and the
> > default declared charset for text/* MIME types, and explicitly
> > declare that this is what the charset is.  The only time that output
> > should be anything else, even Latin-1, is if the programmer
> > specifies such.
> No, please don???t. For unknown MIME types, the charset should be
> undeclared. In particular, `application/octet-stream` should never
> have a charset forced on it if one is not assigned by the client code
> explicitly. Likewise, for `application/xml` and `application/*+xml`, a
> charset should NEVER be explicitly declared, as XML documents are
> self-describing, whereas declaring a charset forces using the charset
> declared in the HTTP header.  This is very unwise (cf. Ruby???s
> Postulate).

Darren discussed the *default* encoding. Like how text/html is a nice
default for the MIME-type, UTF-8 is a nice encoding. Both should be
overridable.

My thoughts:

* Default Content-Type header of "text/html; charset=UTF-8".
* Default output encoding of UTF-8.
* When a new Content-Type is set, but no new encoding
* Keep the default output encoding of UTF-8
* Warn if it's text/* without /charset=/
* Use the specified charset as the output encoding
* Change the output encoding to raw bytes if it's not text/*
* When a new Content-Type is set, and a new encoding is given
* Use the supplied encoding
* Warn if it's text/* without /charset=/
* Warn if supplied encoding and charset aren't equal enough

I think it's important to realise that only text/* have charset, and
that Content-Type is MIME-type plus charset in one value. We shouldn't
be "clever" and separate these: they're one string.

For XML, you'd have to explicitly mention Content-Type and encoding,
because the encoding can no longer be taken from the Content-Type, and
the default for non-text/* is raw bytes.

> > 5.  Similarly, default to trying to treat the HTTP request as
> > UTF-8 if it doesn't specify a character encoding; fallback to
> > Latin-1 only if the text parts of the HTTP request don't look
> > like valid UTF-8.
> This is not just unwise, it is actually wrong. Latin-1 is the
> default for `text/*` MIME types if no charset is declared. Using
> a different charset in violation of the HTTP RFCs is __BROKEN__.

Agreed.

> In fact, now that I???m writing all this out, I am starting to
> think that maybe CGI.pm6 should simply punt on charsets as CGI.pm
> does. Otherwise, the code and API would have to have able to deal
> with the full complexity of charsets in HTTP, and the docs would
> have to explain it, which is no picnic at all.

Simple schemes can always be documented equally simply.

A first attempt:

The default value for the C header is C

The encoding that $module uses for output data is taken from the
C attribute in the C header. If there is no
charset in the C header, UTF-8 is used for all text/*
types, and raw for everything else.

It is possible to explicitly force an output encoding. When you're
not sending a text/* document, you need to do this if the document
does contain text. This is the case with most XML formats.

$response1.type = 'text/html; charset=iso-8859-1';
# implies: $response1.encoding = 'iso-8859-1;

$response2.type = 'application/xml';
$response2.encoding = 'UTF-8';

my $response3 = Web::Response.new :type('text/html; 
charset=iso-8859-1');
my $response4 = Web::Response.new :type, 
:encoding;
-- 
korajn salutojn,

  juerd waalboer:  perl hacker  <[EMAIL PROTECTED]>  
  convolution: ict solutions and consultancy <[EMAIL PROTECTED]>