RE: Invalid XML characters?

Holliday, Donald B. (LNG-CSP) 2 Oct 2003 14:08:04 -0000

Most relational databases store bytes, not characters.  Oracle, for example,
is perfectly happy storing control characters and Latin-1 accented
characters even if you define the database to be USACSII7.  You have to be
careful, though, because they will also store multi-byte characters as
single byte garbage.  Our database storage routines always send the data
through a method that "fixes up" the data, converting it to acceptable ASCII
values before handing it to Oracle.

Thanks,

Donald Holliday

-----Original Message-----
From: Tom Sugden [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 02, 2003 4:45 AM
To: [EMAIL PROTECTED]
Subject: RE: Invalid XML characters?

Thanks, Donald. It is evident that some relational databases allow this
character within text fields (such as dBASE IV on Win2000), so I suppose
I'll have to filter these values before encoding to XML, or else sanitize
the database beforehand. - Tom

-----Original Message-----
From: Holliday, Donald B. (LNG-CSP)
[mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 01, 2003 4:49 PM
To: '[EMAIL PROTECTED]'
Subject: RE: Invalid XML characters?

Valid content for CDATA is

CData      ::=          (Char* - (Char* ']]>' Char*))

Char     ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

0xC is not in this list.

See http://www.w3.org/TR/REC-xml#sec-cdata-sect

Thanks,

Donald Holliday
(719) 481-7501            V
[EMAIL PROTECTED]

-----Original Message-----
From: Tom Sugden [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 01, 2003 8:58 AM
To: [EMAIL PROTECTED]
Subject: Invalid XML characters?

Hello,

I was wondering whether anyone could clarify something for me. I've noticed
some behaviour with the Xerces SAX parser (version 2.4.0 according to jar
manifest file) that may constitute a bug. When attempting to parse some XML
character data that contains an unusual character (Unicode 0xC) wrapped in a
CDATA section, the parser throws an org.xml.sax.SAXParseException.

The XML specification seems to indicate that valid character data is any
Unicode character, excluding the surrogate blocks, FFFE, and FFFF. Since 0xC
is neither within the surrogate blocks nor equivalent to 0xFFFE or 0xFFFF, I
was surprised by this exception. I wrote a small test program to try parsing
a series of documents containing each possible unicode character within a
CDATA section, excluding the surrogate blocks and FFFE and FFFF. This seemed
to identify a further 151 characters that would cause either an
org.xml.sax.SAXParseException or a java.io.UTFDataFormatException to be
raised.

Is this the desired behaviour? And if so, can anyone recommend a technique
for transforming data retrieved from a relational database table (that may
contain these unusual characters) in such a way that it can safely be
encoded into an XML document without raising an exception?

Thanks,
Tom Sugden

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Invalid XML characters?

Reply via email to