christian bindeballe wrote: > Hello Marc, > > Marc Portier schrieb: >
<snip /> > > OK, so I belive I got something wrong. These characters that I thought > to be Unicode-Characters are rather XML-Interpretations? Regarding unicode and encodings, please read this: http://www.joelonsoftware.com/articles/Unicode.html My Shortlist: - avoid using the word 'character' since it's often leading to other interpretations then what you are intending. - use 'glyph' or 'symbol' instead to indicate the typographic idiom people know and write down - understand that the main job of the unicode standard is to assign so called code-points to just about all glyphs that exist out there. These code-points are interchanged between humans in a textual format that starts with U+ and is followed by 4 hexadecimal digits - these code-points are interchanged between computers in byte-sequences, how to map codepoints to byte-sequences is regulated by the encoding - there is more then one encoding to choose from: most common known are iso-8859-1 (latin-1), cp1252, utf-16, utf-8. In other words the same codepoint/glyph can be interchanged in totally different bytesequences - latin-1 is a single byte encoding and doesn't have room for all glyphs in the unicode list... unicode-codepoints for which it doesn't have a byte are mapped to byte 0x80 - utf-8 is a variable-with encoding where depending on the codepoint the encoding might result in a byte sequence of one to (typically) three (but I thought officially up to six) bytes - since an exchanged text-file on disk(cd/usb) or over the net is just a bunch of bytes, it is in fact (theoretically) unreadable if you don't know the applied encoding - xml files allow to specify the encoding of the file itself in the xml declaration (first line of the file, and thus already in a certain encoding:) there is indeed a chicken and egg problem there, and a possible mismatch leading to parser failures if file-encoding doesn't match the declared one - xml files also allow to use so called character-entities to communicate glyphs. Typically they are only used to communicate those glyphs that don't have a valid byte-sequence in the current encoding. These entities folow either one of these patterns: &#(codepoint-in-decimal); &#x(codepoint-in-hexadecimal); - These entities are resolved (just like > < $apos; " and &) by your parser, in other words: in regular XML API's SAX or DOM you will no longer find any reference to them, they got replace by their actual glyph-representation in the programming language of your choice (which in Java actually is utf-16) - These entities are automatically and smartly inserted by the xalan serializers depending on the encoding you force them too > There are often Chars like ” in the feeds. Since these aren't > translated properly and they are not part of Latin-1 I thought they must > be UTF-8, which they obviously aren't, or are they? > no. utf-8 is nowhere in sight here these sequences are on file-level genuine valid iso-8859-1 byte-sequences that make up a glyph-sequence ” which only on XML level is recognised as a 'character entity' and thus interpreted as to be replaced by one single glyph so the question remains: what do you mean by 'not translated correctly'? Note that a final element in this whole discussion is the font you are using: sometimes simple system-fonts don't have a valid glyph-representation available for a perfactly legal communicated codepoint... so you try solving things completely at the wrong end :-( > >>> $ wget -q -O - >>> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ >>> | grep '&#' >> >> >> >> are all punctuation chars that seem to be correctly applied > > > see above :) you're more than probably right > thx for your confidence :-) http://www.unicode.org/charts/PDF/U2000.pdf >> I have never used coplets, nor even looked at them (deeply sorry) >> but I would certainly check the way these feeds are interpreted in the >> first place (rather then how they are serialized) >> >> if that is bad, then nothing furtheron in the pipe will be able to >> produce decent characterstreams regardless of encoding scheme's you're >> trying out on the serializer > > > This is the relevant part of my sitemap: > <map:match pattern="live.rss"> so this url will actually look like: http://yourserver/cocoon/submap/live.rss?feed=http://whatever.de/some.rss right? > <map:generate type="file" src="{request-param:feed}" > label="content" /> this will read the mentioned feed and parse it, since the feeds are ok regarding encoding and character entities I suspect all things would be ok > <map:transform type="xslt" src="styles/rss2html.xsl"> > <map:parameter name="fullscreen" > value="{coplet:aspectDatas/fullScreen}"/> > </map:transform> > <map:serialize type="xml"/> odd, your stylesheet claims in it's name to be targetting html, yet you serialize as xml, just for debugging maybe? > </map:match> > > So my next thought was that it is the XSL that is messing up the RSS. nope, as mentioned earlier: all levels above the parser (so also the xslt engine) don't know about encoding nor character entities, they will deal directly with actual code-points if you like > So I edited the XSL and added this line after the <xsl:stylesheet> > > <xsl:output method="html" encoding="ISO-8859-1"/> > that will not help: any xsl:output directive is overriden when used in a cocoon pipeline normal xsl in fact allows to specify both the transformation aspect as the serialization aspect. This makes sense since xsl engines are often used in a xmlfile to xmlfile manner inside cocoon however xslt is always applied in a SAX to SAX manner, i.e. doing only the transform step, and thus all serialization specific aspects (the xsl:output element) are ignored the good news here: you can use the xsl:output in an optimal way for your xslt-authoring/testing/debugging cycle without fear that those settings will affect the cocoon pipeline operation > but it didn't help either. Maybe someone would like to take a look at > the xsl I attached to see whether there is something wrong with it? > hard to imagine, you would actually need somehting like replace() functions to force replacing of valid characters with any of the unicode codepoints for unrepresentable characters at the unicode level (which exist by the way: http://www.unicode.org/charts/PDF/U2000.pdf to allow decoders to flag unreadable byte-sequences) all other characters remain visible on the unicode level and should, upon re-serialization (even to iso-8859-1) just be replaced to decent xml character entities afaics that is the only other thing you can check now: you might apply the checks I used on the feeds directly on the endresult of your system... check the HTTP header, xml declaration, and scan for the parts where you know that the 'funny chars' occur... if they enter your browser as legal character entities, then I'm afraid you are looking at a font problem (hard to imagine in these modern times though) HTH, anyways, I'm interested to hear what is actually going on over there :-) -marc= -- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ [EMAIL PROTECTED] [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
