I am uncertain whether the behavior I am seeing in the Xerces DOM parser (2.6.2) is correct. Specifically, I am unclear as to what character values should appear in a DOM string after parsing a document that uses a character encoding such as ISO-8859-1 or Windows-1252. Here is a specific example to illustrate the question: Suppose a document that specifies encoding="ISO-8859-1" contains a byte value 0x93 as part of the text content of an element. This is a double left quote character (a "smart quote" in Windows terminology). This is a legal character for the encoding. However, the Unicode index for LEFT DOUBLE QUOTATION MARK is 0x201C. So, once this document is parsed into a DOM, should the DOM contain the character value 0x93 or the Unicode value 0x201C? Based on the DOM Level 2 Core specification, it seems the DOM should contain 0x201C because the spec says, "Applications must encode DOMString using UTF-16 (defined in [Unicode] and Amendment 1 of [ISO/IEC 10646])." See http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 However, after parsing the document with Xerces, the DOM contains the character value 0x93 from the original source document (which, in Unicode, is a "set transmit state" control character and not a left double quote). Is this a Xerces bug? If so, can anyone offer advice as to where to look in the Xerces source to start debugging? Thanks, -- fas F. Andy Seidl, Co-founder MyST Technology Partners Creators of MySmartChannels
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]