On 9/7/10 9:16 AM, Philip Jägenstedt wrote:
UTF-8, Big5 and GBK are all (as far as I know) ASCII supersets. Do
real-world text documents include \0 bytes?

Yes.  Real-world text documents include all sorts of gunk.  Just rarely.

As long as "indicates an encoding" doesn't include UTF-8 or ISO-8859-1
(thanks, Apache!), that should be reasonable, I think.

Are you saying that Apache has, at various times, set the default
character encoding to UTF-8 or ISO-8859-1?

Yes, precisely. Though the UTF-8 stuff was Linux distros, I think, not Apache itself (in that Apache just sent the thing passed to AddDefaultCharset and they changed the value of that from ISO-8859-1 to UTF-8 in their distro packages). Here's the relevant comment from the Gecko source where we do our text-or-binary sniffing for toplevel contexts:

 Make sure to do a case-sensitive exact match comparison here.  Apache
 1.x just sends text/plain for "unknown", while Apache 2.x sends
 text/plain with a ISO-8859-1 charset.  Debian's Apache version, just to
 be different, sends text/plain with iso-8859-1 charset.  For extra fun,
 FC7, RHEL4, and Ubuntu Feisty send charset=UTF-8.  Don't do general
 case-insensitive comparison, since we really want to apply this crap as
 rarely as we can.

I was hoping that no encoding parameter at all would be sent :/

Heh. I've long since given up all hope of reason on this stuff; I just try to keep it as sane and predictable and simple as possible. :(

-Boris

Reply via email to