Hi all, I noticed that the HTTP Server which serves tomcat.apache.org (and also other Apache sites) automatically includes a "charset=UTF-8" field in the Content-Type header for static *.html files and for *.txt files, independently from the actual encoding of the file. E.g., if you request http://tomcat.apache.org/index.html (static html page), then the Content-Type header will be:
Content-Type=text/html; charset=utf-8 Now, although I'm a fan of using UTF-8 for everything (especially for Web pages), and including a "charset" field in the Content-Type probably saves the browser some time as it doesn't need to find out the encoding from the file content, this means that some .html pages have conflicting encoding declarations, as not all .html pages on the Tomcat Website are encoded as UTF-8. E.g., for this page: http://tomcat.apache.org/tomcat-6.0-doc/index.html the Encoding in the Content-Type header says "UTF-8", but the encoding which is declared in the file content says "ISO-8859-1" which is the actual encoding of the file. As the encoding from the HTTP Content-Type header takes precedence, browsers will interpret the file as UTF-8 instead of ISO-8859-1. This can mean that if the file contains non-ASCII characters (> 0x7F), a browser will display them incorrectly because of the wrong encoding. This affects mostly the Docs of older Tomcat versions (3, 4, 5, 6, 7) as they are in ISO-8859-1, whereas Tomcat 8's docs are in UTF-8. (Though as far as I have seen, none of these .html pages use non-ASCII characters directly but encode them as entity references or character references, so for them this issue does not have practical consequences.) While for Tomcat 6 and 7 the XSLT probably can be changed to output as UTF-8, I don't know if something like this should be done for docs of unsupported versions like 3.x etc. This is an example of a site where the conflicting encodings cause problems: http://commons.apache.org/proper/commons-dbcp/ In the LHS menu, there is a <h5> element with text "Commons DBCP", but the space is actually a 0xA0 character (nbsp). As this is a non-ASCII character, browsers will fail to decode it when using UTF-8, so they display "�" (U+FFFD, Replacement Character) instead. If you manually change the encoding to ISO-8859-1 in the browser's menu, the page will be displayed correctly. It seems that this issue has been existing for some time now, as with r1182745, the output encoding of the Tomcat Site's XSLT has been changed to UTF-8 by Konstantin Kolinko, with the commit message: "Change output encoding, so that <META> header added by XSTL processor matches with HTTP Content-Type header added by tomcat.apache.org site." Does anybody know the reasoning behind adding a "charset=UTF-8" field in the Content-Type for every *.html page? Should a issue be raised for this at Apache Infra? Thanks! Regards, Konstantin Preißer --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org