On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote:
Hello,
While trying to open an xmi file after processing in xml view, an
error pops up telling me that there is an invalid  xml character.
the error comes from the sax parser. Below is the stack trace. Thanks
very much for your help,
Leo,
Hmm, looks like we have a bug here...
Most control characters are not allowed in XML 1.0, even if they are
escaped with &#xxx. If your input document contains such characters,
the XMI CAS serializer is writing them to the output XMI document,
making it unreadable.
One workaround might be for you to strip control characters from your
input documents. This test should return true for valid XML
characters, false for invalid ones;
(c >= 0x20 && c < 0xFFFE) || c == 0x09 || c == 0x0A || c == 0x0D
Also I checked that if you edit the XMI document and change the first line to:
<?xml version="1.1" encoding="UTF-8"?>
The problem goes away, because XML version 1.1 does allow escaped
control characters.
So one possibility for us to fix this in UIMA is to have the XMI CAS
Serializer generate XML version 1.1 tag by default. (I think we
considered that before and decided not to for some reason, maybe we
were worried that other applications might not be able to consume XML
1.1? I can't remember. :)
Another possibility would be to have the XMI serializer automatically
replace these characters with spaces. The XCAS (not XMI) serializer
does that, but only for the document text, not for feature values. We
could also serialize the XMI using XML version 1.1, which allows
escaped control characters (but still not the 0x00 character).
-Adam