How does document encoding affect DOMString character values in a resulting DOM? (possible Xerces bug)

F. Andy Seidl 7 May 2004 18:38:49 -0000

I am uncertain whether the behavior I am seeing in the Xerces DOM parser
(2.6.2) is correct.  Specifically, I am unclear as to what character values
should appear in a DOM string after parsing a document that uses a character
encoding such as ISO-8859-1 or Windows-1252.
Here is a specific example to illustrate the question:
Suppose a document that specifies encoding="ISO-8859-1" contains a byte
value 0x93 as part of the text content of an element.  This is a double left
quote character (a "smart quote" in Windows terminology).  This is a legal
character for the encoding.  However, the Unicode index for LEFT DOUBLE
QUOTATION MARK is 0x201C.
So, once this document is parsed into a DOM, should the DOM contain the
character value 0x93 or the Unicode value 0x201C?
Based on the DOM Level 2 Core specification, it seems the DOM should contain
0x201C because the spec says, "Applications must encode DOMString using
UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])."
See http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578
However, after parsing the document with Xerces, the DOM contains the
character value 0x93 from the original source document (which, in Unicode,
is a "set transmit state" control character and not a left double quote).
Is this a Xerces bug?  If so, can anyone offer advice as to where to look in
the Xerces source to start debugging?
Thanks,
  -- fas
F. Andy Seidl, Co-founder
MyST Technology Partners
Creators of MySmartChannels





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How does document encoding affect DOMString character values in a resulting DOM? (possible Xerces bug)

Reply via email to