[
https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492956
]
Adam Lally commented on UIMA-387:
---------------------------------
There are several websites out there discouraging the use of XML 1.1. For
example: http://www.cafeconleche.org/books/effectivexml/chapters/03.html. I'm
no XML expert but these make me wonder if switching to 1.1 will cause some
application to have difficulty using our XML output. Perhaps, though, it's the
least of the evils here.
EMF throws an exception during serialization, if you try to serialize a String
containing an invalid XML character. There's an option you can set to tell it
to serialize to XML 1.1, in which case it will properly escape the character
and won't throw an exception. UIMA supports that too, you can do
XMLSerializer.setOutputProperty(OutputKeys.VERSION, "1.1") to tell the
serializer to generate XML 1.1. But there's no way to turn that on via a
configuration parameter to the XmiWriterCasConsumer or through GUIs like the
DocumentAnalyzer. So one possibility is just to expose that switch.
I think it is reasonable to disallow nulls. Since we're using the XMI standard
I don't think we can get into defining our own special ways around the
limitations of XML. That would make our XMI output not consumable by other
systems such as EMF.
> XMI Serializer can write invalid control characters
> ---------------------------------------------------
>
> Key: UIMA-387
> URL: https://issues.apache.org/jira/browse/UIMA-387
> Project: UIMA
> Issue Type: Bug
> Components: Core Java Framework
> Affects Versions: 2.1
> Reporter: Adam Lally
> Fix For: 2.2
>
>
> On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > While trying to open an xmi file after processing in xml view, an
> > error pops up telling me that there is an invalid  xml character.
> > the error comes from the sax parser. Below is the stack trace. Thanks
> > very much for your help,
> >
> Most control characters are not allowed in XML 1.0, even if they are
> escaped with &#xxx. If your input document contains such characters,
> the XMI CAS serializer is writing them to the output XMI document,
> making it unreadable.
> I checked that if you edit the XMI document and change the first line to:
> <?xml version="1.1" encoding="UTF-8"?>
> The problem goes away, because XML version 1.1 does allow escaped
> control characters.
> So one possibility for us to fix this in UIMA is to have the XMI CAS
> Serializer generate XML version 1.1 tag by default. (I think we
> considered that before and decided not to for some reason, maybe we
> were worried that other applications might not be able to consume XML
> 1.1? I can't remember. :)
> Another possibility would be to have the XMI serializer automatically
> replace these characters with spaces. The XCAS (not XMI) serializer
> does that, but only for the document text, not for feature values. We
> could also serialize the XMI using XML version 1.1, which allows
> escaped control characters (but still not the 0x00 character).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.