christian b wrote:
> Thank you, Ard
> 
> I already did that. But it doesn't change anything. I found this in my
> web.xml in the WEB-INF folder of my cocoon-build:
> 
> <!--
>       Set encoding used by the container. If not set the ISO-8859-1 encoding
>       will be assumed.
>       Since the servlet specification requires that the ISO-8859-1 encoding
>       is used (by default), you should never change this value unless
>       you have a buggy servlet container.
>     -->
>     <init-param>
>       <param-name>container-encoding</param-name>
>       <param-value>ISO-8859-1</param-value>
>     </init-param>
> 
> Servlet-Container used is Tomcat 5.0.28
> 
> I switched the encoding parameter to UTF-8 to check whether it would
> work, and it seems to. But still the coplets aren't encoded properly.
> 

never change your container-encoding unless you have a servlet container
of which you can specify the used encoding applied in decoding of url's
and request parameters

(if you don't understand what I just said: that translates to simply
"never")


e.g. when you use jetty (the only one I know) you can specifiy a system
property -Dorg.mortbay.util.URI.charset=utf-8

only then the cocoon servlet init param should be changed to match that


> Then I saw that the whole page is encoded in ISO-8859-1, having been
> serialized in HTML (as seen in the doctype of the page). So I looked
> for the HTML-Serializer in my portal/sitemap.xmap and changed the
> encoding of the html-serializer, too. no difference
> 
> these are the feed-adresses that I want to incorporate. both don't
> have an encoding set (do RSS-feeds have to have that?) but they
> clearly contain UTF-8 encoded characters.
> 

like where? I just did a rough scan but couldn't find any 'multiple byte
for single character' occurances

note that many 'at first glance odd' characters DO have a valid position
in the ISO-8859-1 encoding

e.g. U+00DF, the typical german LATIN SMALL LETTER SHARP S = Eszett is
just encoded as the single byte hex DF in latin 1

it's not that because a certain character requires 2 bytes in UTF-8
encoding that this character _IS_ an UTF_8 encoded char, the same
character might very well have a valid and usefule single byte latin 1
encoding.

(in other words: the 'encoding' is never a property of the glyph, but I
admit: yeah, some glyphs don't have representations in all encodings)


> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/

the http header states that the file is iso-8859-1 encoded:
(see the content-type header)

> 
> $ wget -S --spider 
> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
> --15:33:26--  
> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
>            => `index.html'
> Resolving www.industrial-technology-and-witchcraft.de... 212.227.64.59
> Connecting to 
> www.industrial-technology-and-witchcraft.de|212.227.64.59|:80... connected.
> HTTP request sent, awaiting response...
>   HTTP/1.0 200 OK
>   Date: Mon, 16 Jan 2006 14:33:26 GMT
>   Server: Apache/1.3.33 (Unix)
>   Cache-Control: no-store, no-cache, must-revalidate, post-check=0, 
> pre-check=0
>   Expires: Mon, 16 Jan 2006 13:40:42 GMT
>   Pragma: no-cache
>   X-Powered-By: PHP/4.4.1
>   Set-Cookie: exp_last_visit=822058407; expires=Tue, 16 Jan 2007 14:33:27 
> GMT; path=/
>   Set-Cookie: exp_last_activity=1137418407; expires=Tue, 16 Jan 2007 14:33:27 
> GMT; path=/
>   Set-Cookie: 
> exp_tracker=a%3A1%3A%7Bi%3A0%3Bs%3A15%3A%22%2FITW%2Fitw-rss20%2F%22%3B%7D; 
> path=/
>   Last-Modified: Mon, 16 Jan 2006 12:40:42 GMT
>   Content-Type: text/xml; charset=iso-8859-1;
>   X-Cache: MISS from proxy2
>   X-Cache-Lookup: MISS from proxy2:8080
>   Connection: keep-alive
> Length: unspecified [text/xml]
> 200 OK

and going with that the feed's xml declaration is nicely claiming the same:

> $ wget -q -O - 
> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | 
> head -1
> <?xml version="1.0" encoding="iso-8859-1"?>

at first glance it also looks like a valid claim, with special
characters nicely encoded as XML entities

the ones I found with:
> $ wget -q -O - 
> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | 
> grep '&#'

are all punctuation chars that seem to be correctly applied


> http://www.netzpolitik.org/feed/
> 

this one also has ISO_8859_1 encoding according to http header and xml
declaration

so both seem ok

> so, I guess that somewhere along the line from generating to
> serializing these feeds are messed with in a way that the encoding set
> in the serializers has no effect whatsoever.
> 
> suggestions as to where this could be, anyone?
> 

I have never used coplets, nor even looked at them (deeply sorry)
but I would certainly check the way these feeds are interpreted in the
first place (rather then how they are serialized)

if that is bad, then nothing furtheron in the pipe will be able to
produce decent characterstreams regardless of encoding scheme's you're
trying out on the serializer



so, what do you do exaclty, and what is the end result you see?

do you see often uppercase (often A) characters with strange accents?
those are mostly indication that valid utf-8 was read as being latin-1
while it wasn't

the opposite would result in invalid characters, often visualized as
rectangle boxes, in the stream they should be indicated as some (I
forgot the exact) unicode char in the upper regions (U+FFxx range somewhere)


> it would be greatly appreciated :)
> 
> regards, christian
> 
> 2006/1/16, Ard Schrijvers <[EMAIL PROTECTED]>:
> 
>>Think you should have no problem at all when you just serialize everything as 
>>utf-8:
>>
>><map:serializer logger="sitemap.serializer.xml" mime-type="text/xml" 
>>name="xml" pool-grow="4" pool-max="32" pool-min="4" 
>>src="org.apache.cocoon.serialization.XMLSerializer">
>><encoding>UTF-8</encoding>
>></map:serializer>
>>

on the side: you don't need to set your serializer specific encoding if
you have set the form-encoding init param in the web.xml to utf-8 (which
I would suggest at all times)

regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
[EMAIL PROTECTED]                              [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]