christian b wrote: > Thank you, Ard > > I already did that. But it doesn't change anything. I found this in my > web.xml in the WEB-INF folder of my cocoon-build: > > <!-- > Set encoding used by the container. If not set the ISO-8859-1 encoding > will be assumed. > Since the servlet specification requires that the ISO-8859-1 encoding > is used (by default), you should never change this value unless > you have a buggy servlet container. > --> > <init-param> > <param-name>container-encoding</param-name> > <param-value>ISO-8859-1</param-value> > </init-param> > > Servlet-Container used is Tomcat 5.0.28 > > I switched the encoding parameter to UTF-8 to check whether it would > work, and it seems to. But still the coplets aren't encoded properly. >
never change your container-encoding unless you have a servlet container of which you can specify the used encoding applied in decoding of url's and request parameters (if you don't understand what I just said: that translates to simply "never") e.g. when you use jetty (the only one I know) you can specifiy a system property -Dorg.mortbay.util.URI.charset=utf-8 only then the cocoon servlet init param should be changed to match that > Then I saw that the whole page is encoded in ISO-8859-1, having been > serialized in HTML (as seen in the doctype of the page). So I looked > for the HTML-Serializer in my portal/sitemap.xmap and changed the > encoding of the html-serializer, too. no difference > > these are the feed-adresses that I want to incorporate. both don't > have an encoding set (do RSS-feeds have to have that?) but they > clearly contain UTF-8 encoded characters. > like where? I just did a rough scan but couldn't find any 'multiple byte for single character' occurances note that many 'at first glance odd' characters DO have a valid position in the ISO-8859-1 encoding e.g. U+00DF, the typical german LATIN SMALL LETTER SHARP S = Eszett is just encoded as the single byte hex DF in latin 1 it's not that because a certain character requires 2 bytes in UTF-8 encoding that this character _IS_ an UTF_8 encoded char, the same character might very well have a valid and usefule single byte latin 1 encoding. (in other words: the 'encoding' is never a property of the glyph, but I admit: yeah, some glyphs don't have representations in all encodings) > http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ the http header states that the file is iso-8859-1 encoded: (see the content-type header) > > $ wget -S --spider > http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ > --15:33:26-- > http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ > => `index.html' > Resolving www.industrial-technology-and-witchcraft.de... 212.227.64.59 > Connecting to > www.industrial-technology-and-witchcraft.de|212.227.64.59|:80... connected. > HTTP request sent, awaiting response... > HTTP/1.0 200 OK > Date: Mon, 16 Jan 2006 14:33:26 GMT > Server: Apache/1.3.33 (Unix) > Cache-Control: no-store, no-cache, must-revalidate, post-check=0, > pre-check=0 > Expires: Mon, 16 Jan 2006 13:40:42 GMT > Pragma: no-cache > X-Powered-By: PHP/4.4.1 > Set-Cookie: exp_last_visit=822058407; expires=Tue, 16 Jan 2007 14:33:27 > GMT; path=/ > Set-Cookie: exp_last_activity=1137418407; expires=Tue, 16 Jan 2007 14:33:27 > GMT; path=/ > Set-Cookie: > exp_tracker=a%3A1%3A%7Bi%3A0%3Bs%3A15%3A%22%2FITW%2Fitw-rss20%2F%22%3B%7D; > path=/ > Last-Modified: Mon, 16 Jan 2006 12:40:42 GMT > Content-Type: text/xml; charset=iso-8859-1; > X-Cache: MISS from proxy2 > X-Cache-Lookup: MISS from proxy2:8080 > Connection: keep-alive > Length: unspecified [text/xml] > 200 OK and going with that the feed's xml declaration is nicely claiming the same: > $ wget -q -O - > http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | > head -1 > <?xml version="1.0" encoding="iso-8859-1"?> at first glance it also looks like a valid claim, with special characters nicely encoded as XML entities the ones I found with: > $ wget -q -O - > http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | > grep '&#' are all punctuation chars that seem to be correctly applied > http://www.netzpolitik.org/feed/ > this one also has ISO_8859_1 encoding according to http header and xml declaration so both seem ok > so, I guess that somewhere along the line from generating to > serializing these feeds are messed with in a way that the encoding set > in the serializers has no effect whatsoever. > > suggestions as to where this could be, anyone? > I have never used coplets, nor even looked at them (deeply sorry) but I would certainly check the way these feeds are interpreted in the first place (rather then how they are serialized) if that is bad, then nothing furtheron in the pipe will be able to produce decent characterstreams regardless of encoding scheme's you're trying out on the serializer so, what do you do exaclty, and what is the end result you see? do you see often uppercase (often A) characters with strange accents? those are mostly indication that valid utf-8 was read as being latin-1 while it wasn't the opposite would result in invalid characters, often visualized as rectangle boxes, in the stream they should be indicated as some (I forgot the exact) unicode char in the upper regions (U+FFxx range somewhere) > it would be greatly appreciated :) > > regards, christian > > 2006/1/16, Ard Schrijvers <[EMAIL PROTECTED]>: > >>Think you should have no problem at all when you just serialize everything as >>utf-8: >> >><map:serializer logger="sitemap.serializer.xml" mime-type="text/xml" >>name="xml" pool-grow="4" pool-max="32" pool-min="4" >>src="org.apache.cocoon.serialization.XMLSerializer"> >><encoding>UTF-8</encoding> >></map:serializer> >> on the side: you don't need to set your serializer specific encoding if you have set the form-encoding init param in the web.xml to utf-8 (which I would suggest at all times) regards, -marc= -- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ [EMAIL PROTECTED] [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
