Re: XML-Serializer encoding

Marc Portier Tue, 17 Jan 2006 03:15:49 -0800


christian bindeballe wrote:
> Hello Marc,
> 
> Marc Portier schrieb:
>


<snip />

> 
> OK, so I belive I got something wrong. These characters that I thought
> to be Unicode-Characters are rather XML-Interpretations?

Regarding unicode and encodings, please read this:
http://www.joelonsoftware.com/articles/Unicode.html

My Shortlist:

- avoid using the word 'character' since it's often leading to other
interpretations then what you are intending.
- use 'glyph' or 'symbol' instead to indicate the typographic idiom
people know and write down
- understand that the main job of the unicode standard is to assign so
called code-points to just about all glyphs that exist out there. These
code-points are interchanged between humans in a textual format that
starts with U+ and is followed by 4 hexadecimal digits
- these code-points are interchanged between computers in
byte-sequences, how to map codepoints to byte-sequences is regulated by
the encoding
- there is more then one encoding to choose from: most common known are
iso-8859-1 (latin-1), cp1252, utf-16, utf-8. In other words the same
codepoint/glyph can be interchanged in totally different bytesequences
- latin-1 is a single byte encoding and doesn't have room for all glyphs
in the unicode list... unicode-codepoints for which it doesn't have a
byte are mapped to byte 0x80
- utf-8 is a variable-with encoding where depending on the codepoint the
encoding might result in a byte sequence of one to (typically) three
(but I thought officially up to six) bytes
- since an exchanged text-file on disk(cd/usb) or over the net is just a
bunch of bytes, it is in fact (theoretically) unreadable if you don't
know the applied encoding
- xml files allow to specify the encoding of the file itself in the xml
declaration (first line of the file, and thus already in a certain
encoding:) there is indeed a chicken and egg problem there, and a
possible mismatch leading to parser failures if file-encoding doesn't
match the declared one
- xml files also allow to use so called character-entities to
communicate glyphs. Typically they are only used to communicate those
glyphs that don't have a valid byte-sequence in the current encoding.
These entities folow either one of these patterns:
  &#(codepoint-in-decimal);
  &#x(codepoint-in-hexadecimal);
- These entities are resolved (just like &gt; &lt; $apos; &quot; and
&amp;) by your parser, in other words: in regular XML API's SAX or DOM
you will no longer find any reference to them, they got replace by their
actual glyph-representation in the programming language of your choice
(which in Java actually is utf-16)
- These entities are automatically and smartly inserted by the xalan
serializers depending on the encoding you force them too

> There are often Chars like &#8221; in the feeds. Since these aren't
> translated properly and they are not part of Latin-1 I thought they must
> be UTF-8, which they obviously aren't, or are they?
> 

no. utf-8 is nowhere in sight here

these sequences are on file-level genuine valid iso-8859-1
byte-sequences that make up a glyph-sequence &#8221;

which only on XML level is recognised as a 'character entity' and thus
interpreted as to be replaced by one single glyph


so the question remains: what do you mean by 'not translated correctly'?

Note that a final element in this whole discussion is the font you are
using: sometimes simple system-fonts don't have a valid
glyph-representation available for a perfactly legal communicated
codepoint... so you try solving things completely at the wrong end :-(


> 
>>> $ wget -q -O -
>>> http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
>>> | grep '&#'
>>
>>
>>
>> are all punctuation chars that seem to be correctly applied
> 
> 
> see above :) you're more than probably right
> 

thx for your confidence :-)
http://www.unicode.org/charts/PDF/U2000.pdf

>> I have never used coplets, nor even looked at them (deeply sorry)
>> but I would certainly check the way these feeds are interpreted in the
>> first place (rather then how they are serialized)
>>
>> if that is bad, then nothing furtheron in the pipe will be able to
>> produce decent characterstreams regardless of encoding scheme's you're
>> trying out on the serializer
> 
> 
> This is the relevant part of my sitemap:
> <map:match pattern="live.rss">

so this url will actually look like:

http://yourserver/cocoon/submap/live.rss?feed=http://whatever.de/some.rss

right?

>             <map:generate type="file" src="{request-param:feed}"
> label="content" />

this will read the mentioned feed and parse it, since the feeds are ok
regarding encoding and character entities I suspect all things would be ok

>             <map:transform type="xslt" src="styles/rss2html.xsl">
>                 <map:parameter name="fullscreen"
> value="{coplet:aspectDatas/fullScreen}"/>
>             </map:transform>
>             <map:serialize type="xml"/>

odd, your stylesheet claims in it's name to be targetting html, yet you
serialize as xml, just for debugging maybe?

>         </map:match>
> 
> So my next thought was that it is the XSL that is messing up the RSS.

nope, as mentioned earlier: all levels above the parser (so also the
xslt engine) don't know about encoding nor character entities, they will
deal directly with actual code-points if you like

> So I edited the XSL and added this line after the <xsl:stylesheet>
> 
> <xsl:output method="html" encoding="ISO-8859-1"/>
> 

that will not help: any xsl:output directive is overriden when used in a
cocoon pipeline

normal xsl in fact allows to specify both the transformation aspect as
the serialization aspect. This makes sense since xsl engines are often
used in a xmlfile to xmlfile manner

inside cocoon however xslt is always applied in a SAX to SAX manner,
i.e. doing only the transform step, and thus all serialization specific
aspects (the xsl:output element) are ignored

the good news here: you can use the xsl:output in an optimal way for
your xslt-authoring/testing/debugging cycle without fear that those
settings will affect the cocoon pipeline operation

> but it didn't help either. Maybe someone would like to take a look at
> the xsl I attached to see whether there is something wrong with it?
> 

hard to imagine, you would actually need somehting like replace()
functions to force replacing of valid characters with any of the unicode
codepoints for unrepresentable characters at the unicode level
(which exist by the way: http://www.unicode.org/charts/PDF/U2000.pdf to
allow decoders to flag unreadable byte-sequences)

all other characters remain visible on the unicode level and should,
upon re-serialization (even to iso-8859-1) just be replaced to decent
xml character entities


afaics that is the only other thing you can check now: you might apply
the checks I used on the feeds directly on the endresult of your
system... check the HTTP header, xml declaration, and scan for the parts
where you know that the 'funny chars' occur... if they enter your
browser as legal character entities, then I'm afraid you are looking at
a font problem (hard to imagine in these modern times though)


HTH,
anyways, I'm interested to hear what is actually going on over there :-)

-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
[EMAIL PROTECTED]                              [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: XML-Serializer encoding

Reply via email to