On Fri, Oct 17, 2003 at 04:09:31PM +0100, Greg Farrell wrote: > The following text is correctly parsed in windows, > > <DATA> > (?[ASSIGN,Found,TEXT,YES])|([ASSIGN,Found,TEXT,FALSE]) > </DATA> > > however in linux the cross of loraine (?) character is stripped, as > is the data immediately after it. This also happens with xerces 2.3. > Can anyone suggest a way around this problem? Or even better, a fix > for it.
I think that the problem here is character sets. I am assuming that your XML does not have an XML declaration. According to [1], in the absence of a declaration specifying an encoding (and no hint provided by an external transport protocol) then UTF-8 should be assumed. The character 134 in UTF-8, is not a valid character for starting a multi-byte sequence. Hence your document is not valid. In order for it to be valid, then either your should have an xml declaration, stating that the text is in the Windows-1251 encoding, correctly represent your character in UTF-8, or include it as an entity. Adding the declaration would probably be the easiest, as it would just involve adding a line saying <?xml encoding="windows-1251"?> At the beginning of your document, however I am not sure that Xerces-C 1.7 would support transcoding from the Windows codepage. Correctly representing the character in UTF-8 might be possible, depending on what you are using to create the document, however I am not convinced this would be the easiest. In this case this character should be represented by the 4 bytes 0xe2, 0x98, 0xa8, 0x0a According to [2], the entity for the "cross of lorraine" is ☨, so your document would be valid were it <DATA> (☨[ASSIGN,Found,TEXT,YES])|([ASSIGN,Found,TEXT,FALSE]) </DATA> I hope that this helps. I do find that a lot of problems with XML documents are due to character set mismatches such as this. David [1] http://www.w3.org/TR/REC-xml#NT-EncodingDecl [2] http://ppewww.ph.gla.ac.uk/~flavell/unicode/unidata26.html -- David Sheldon, Client Services DecisionSoft Ltd. Telephone: +44-1865-203192 http://www.decisionsoft.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]