Hi all,

I noticed that the HTTP Server which serves tomcat.apache.org (and also other 
Apache sites) automatically includes a "charset=UTF-8" field in the 
Content-Type header for static *.html files and for *.txt files, independently 
from the actual encoding of the file.
E.g., if you request http://tomcat.apache.org/index.html (static html page), 
then the Content-Type header will be:

Content-Type=text/html; charset=utf-8

Now, although I'm a fan of using UTF-8 for everything (especially for Web 
pages), and including a "charset" field in the Content-Type probably saves the 
browser some time as it doesn't need to find out the encoding from the file 
content, this means that some .html pages have conflicting encoding 
declarations, as not all .html pages on the Tomcat Website are encoded as UTF-8.

E.g., for this page:
http://tomcat.apache.org/tomcat-6.0-doc/index.html
the Encoding in the Content-Type header says "UTF-8", but the encoding which is 
declared in the file content says "ISO-8859-1" which is the actual encoding of 
the file.

As the encoding from the HTTP Content-Type header takes precedence, browsers 
will interpret the file as UTF-8 instead of ISO-8859-1. This can mean that if 
the file contains non-ASCII characters (> 0x7F), a browser will display them 
incorrectly because of  the wrong encoding. This affects mostly the Docs of 
older Tomcat versions (3, 4, 5, 6, 7) as they are in ISO-8859-1, whereas Tomcat 
8's docs are in UTF-8. (Though as far as I have seen, none of these .html pages 
use non-ASCII characters directly but encode them as entity references or 
character references, so for them this issue does not have practical 
consequences.)

While for Tomcat 6 and 7 the XSLT probably can be changed to output as UTF-8, I 
don't know if something like this should be done for docs of unsupported 
versions like 3.x etc.

This is an example of a site where the conflicting encodings cause problems:
http://commons.apache.org/proper/commons-dbcp/

In the LHS menu, there is a <h5> element with text "Commons DBCP", but the 
space is actually a 0xA0 character (nbsp). As this is a non-ASCII character, 
browsers will fail to decode it when using UTF-8, so they display "�" (U+FFFD, 
Replacement Character) instead. If you manually change the encoding to 
ISO-8859-1 in the browser's menu, the page will be displayed correctly.

It seems that this issue has been existing for some time now, as with r1182745, 
the output encoding of the Tomcat Site's XSLT has been changed to UTF-8 by 
Konstantin Kolinko, with the commit message:
"Change output encoding, so that <META> header added by XSTL processor matches 
with
HTTP Content-Type header added by tomcat.apache.org site."


Does anybody know the reasoning behind adding a "charset=UTF-8" field in the 
Content-Type for every *.html page? Should a issue be raised for this at Apache 
Infra?


Thanks!

Regards,
Konstantin Preißer


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to